An inertial algorithm for DC programming

(1)

(will be inserted by the editor)

An inertial algorithm for DC programming

Welington de Oliveira¹ · Michel Tcheou²

January 12, 2018

We consider nonsmooth optimization problems whose objective function is defined by theDiffer- ence of Convex (DC) functions. With the aim of computing critical points that are alsod(irectional)- stationary for such a class of nonconvex programs we propose an algorithmic scheme equipped with an inertial-force procedure. In contrast to the classical DC algorithm of P. D. Tao and L. T. H.

An, the proposed inertial DC algorithm defines trial points whose sequence of functional values is not necessary monotonically decreasing, a property that proves useful to prevent the algorithm from converging to a critical point that is notd-stationary. Moreover, our method can handle inexactness in the solution of convex subproblems yielding trial points. This is another property of practical interest that substantially reduces the computational burden to computed-stationary/critical points of DC programs. Convergence analysis of the proposed algorithm yields global convergence to critical points, and convergence rate is established for the considered class of problems. Numerical experiments on large-scale (nonconvex and nonsmooth) image denoising models show that the proposed algorithm outperforms the classic one in this particular application, more specifically in the case of piecewise constant images with neat edges such as QR codes.

Keywords DC programming·Nonsmooth optimization·Variational Analysis

1 Introduction

In this work we consider nonconvex nonsmooth optimization problems of the form

x∈minRⁿf(x), with f(x) :=f1(x)−f2(x), (1) wheref1, f2:Rⁿ→R∪ {+∞}are convex and possibly nonsmooth functions. Problems of this type are known in the literature as DC programs, with “DC” standing fordifference of convex functions, [16]. We assume throughout this manuscript that f1 is a closed function and thatDom(f1)⊂Ω⊂ Dom(f₂), whereΩ is an open and convex set inRⁿ. This assumption allows us to encompass convex constrained DC programs in formulation (1). Indeed, notice that the first component function f1

can be, for example, the sum of a convex functionϕwith the indicator functioniX of a convex set X⊂Ω, i.e.,f1(x) =ϕ(x) +iX(x).

DC programming forms an important sub-field of nonconvex programming and has been receiv- ing much attention from the mathematical programming community, [16, 32, 22, 29, 20, 34, 14]. More general DC programs with DC constraints are investigated in [23, 30, 31, 27, 35]. Some applications

1MINES ParisTech, PSL – Research University, CMA – Centre de Math´ematiques Appliqu´ees, France.

E-mail: welington.oliveira@mines-paristech.fr

2Universidade do Estado do Rio de Janeiro, Brazil.

(2)

include production-transportation planning problems [18], location planning problems [34, Chapter 5], physical layer based security in a digital communication systems [27], chance-constrained problems [10], cluster analysis [5], engineering design problems [34], energy management problems [35]

and others. We refer to [34, Part II] for a comprehensive presentation of several algorithms designed for DC optimization problems. Solving globally nonsmooth programs as (1) is a challenging task, especially in the large-scale setting. We will, therefore, deal with this class of problems by employing local-solution approaches.

A well-known method for dealing with optimization problems as (1) is theDC Algorithm - DCA of [32] (see also [33, 4, 22]). The classical DCA handles problem (1) by defining a sequence of trial points according to the following rule, for a given starting pointx⁰∈ Dom(f₁):

for allk= 0,1,2, . . ., computeg^k2 ∈∂f2(x^k) andx^k+1∈∂f1^∗(g2^k), (2) where f₁^∗ is the conjugate function off1. It can be shown (see [32, Theorem 3]) that every cluster point ¯x (if any) of the sequence{x^k}generated by rule (2) is a critical point of problem (1), i.e., ¯x satisfies

∂f1(¯x)∩∂f2(¯x)6=∅. (3)

It follows from convexity of the component functions f1 andf2 that rule (2) yields a monotone sequence of function values, i.e.,f(x^k+1)≤f(x^k) for allk= 0,1, . . .(see [32, Theorem 3(i)] for more details). While monotonicity might be seen as a quality of the method, demanding monotonically decreasing function values might not be an ideal scheme in nonconvex optimization: depending on starting points, iterates are attracted by poor-quality critical points that prevent the (monotone) algorithm from computing a critical point of better quality. In the context of DC programming, we mean by a “critical point of better quality” a critical point ¯xthat isd(irectional)-stationary, i.e., ¯x satisfies

∂f2(¯x)⊂∂f1(¯x). (4)

As shown in [27],d-stationarity is the sharpest stationary definition for nonconvex problems of type (1); (see also additional comments in Section 2). It is clear from its definition that computing d- stationary points for (1) is not a trivial task in general. Few exceptions are the situations in which eitherf1 orf2 are differentiable, [12, Section 2], and the case when f2 is the pointwise maximum of finitely many convex and differentiable functions. The latter case is investigated in [27], where the authors propose a proximal linearized method that solves several convex programs per iteration, and thus has a high computational burden. A less computational demanding method is a variant of the proximal bundle algorithm proposed in [10], but requires solving many quadratic programs per iteration; see [10, Algorithm 2].

In this work, we are concerned with algorithms of low computational costs to compute critical points. Moreover, we do not assume thatf2 is the pointwise maximum of finitely many convex and differentiable functions. In order to try to transpose critical points that are not d-stationary, we furnish the DCA algorithm represented by rule (2) with an inertial scheme that can be seen as a version of the heavy-ball method of Polyak, [28].

In a differentiable and convex framework, the heavy-ball method is a two-step gradient algorithm that can be interpreted as an explicit finite differences discretization of the so-calledheavy-ball with friction dynamical system, [26]. In summary, the algorithm incorporates an inertial force in the iterative process of gradient methods by appending to the negative-gradient direction the inertial termγ(x^k−x^k−1), whereγ≥0 is a given parameter. The heavy-ball method has been generalized in several manners: the authors of [38] consider differentiable but nonconvex optimization problems, and in [3, 24, 2] the heavy-ball method was extended to handle maximal monotone operators. The paper [26] deals with nonsmooth nonconvex optimization problems of the form minx∈Rⁿf1(x) +ξ(x), where f1 is as above and ξ : Rⁿ → R is a smooth nonconvex function with Lipschitz continuous gradient. The authors of [26] assumef1+ξto be coercive and propose a linearized proximal method with inertial force to compute a stationary point.

In the DC context,ξ=−f2is a concave function. However, in this work we do not assume that eitherf2is smooth norf =f1−f2is coercive. Our proposal follows the general lines of the DCA and

(3)

replaces the subgradientg₂^kof the second component functionf2 in rule (2) byg₂^k+γ(x^k−x^k−1), withγ≥0. This leads to the following iterative scheme, for a given pointx⁰∈ Dom(f1):

for allk= 0,1,2, . . ., compute g^k₂ ∈∂f2(x^k) andx^k+1∈∂f₁^∗(g₂^k+γ(x^k−x^k−1)). (5) As we will see in Section 3 below, the sequence of function values issued by the above scheme is not necessarily monotone due to the inertia imposed by the term γ(x^k−x^k−1). As already said, this property can be beneficial for the quest of computing critical points that are alsod-stationary for (1). Nevertheless, it is not assured that ourInertial DC Algorithm - IDCA, illustrated by rule (5), will compute d-stationary points, but critical ones. Numerical experiments reported in Section 6 below show that IDCA is more robust than DCA, meaning that for the same starting points IDCA computes very often better critical points than DCA does.

Rule (5) (as well as (2)) are not practical when a subgradient off1^∗ is not easily computed. For this reason, the given algorithm relaxes the requirementx^k+1∈∂f₁^∗(g^k₂+γ(x^k−x^k−1)) by a milder one, demandingx^k+1 to belong to an approximate subdifferential off₁^∗ at pointg₂^k+γ(x^k−x^k−1), with a vanishing approximation error. This is a second property of practical interest of our algorithm as it substantially reduces the computational burden to compute ad-stationary/critical point of DC programs.

The remaining of this work is organized as follows: Section 2 provides some notation and preliminary results that will be employed throughout the text. In addition, the section contains two examples illustrating the benefits of incorporating an inertial force to the DC algorithm. Section 3 presents our inertial DC algorithmic pattern as well as its convergence analysis and convergence rate. In Section 4 we discuss some practical issues concerning the implementation of some variants of our algorithm, and the application of interest is discussed in Section 5: nonconvex image denoising models. Section 6 reports some preliminary numerical results comparing IDCA against DCA and the nonconvex algorithm IPIANO of [26]. Finally, Section 7 closes the paper with some concluding remarks.

2 Notation, main definitions and illustrative examples

For any pointsx, y∈Rⁿ,hx, yistands for the Euclidean inner product, andk · kfor the associated norm, i.e.,kxk=p

hx, xi. For a setX⊂Rⁿ, we denote byiX its indicator function, i.e.,iX(x) = 0 ifx∈X andiX(x) = +∞otherwise. For a convex setX its normal cone at the pointxis denoted byNX(x), which is the set{y : hy, z−xi60∀z∈X}ifx∈X and the empty set otherwise. Given a convex functionfi, we denote its subdifferential at the point xby

∂fi(x) ={g_i : fi(y)>fi(x) +hg_i, y−xi ∀y∈Ω}.

The approximate subdiferential is denoted, for≥0, as∂fi(x) :={gi∈Rⁿ: fi(y)≥fi(x) +hgi, y−

xi −∀y∈Ω}. As convex functions are directionally differentiable, the limit

fi⁰(x;d) := lim

t↓0

fi(x+td)−fi(x) t

is well defined for all x∈Ω andd∈Rⁿ. Since DC functions are also locally Lipschitz continuous (because their componentsfiare so), their directional derivatives are well defined as well and satisfy:

f⁰(x;d) =f₁⁰(x;d)−f₂⁰(x;d).

A point ¯x∈Rⁿis ad(irectional)-stationary point of problem (1) iff⁰(¯x; (x−¯x))≥0 for allx∈Rⁿ, which can be shown to be equivalent to the inclusion (4). Notice that verifying (4) computationally is impractical in many cases of interest. Hence, one generally employs a weaker notion of stationarity:

a point ¯x ∈ Rⁿ is called a critical point of problem (1) if ¯x satisfies (3). In summary, all local minimizers of problem (1) are d-stationary points, which in turn are critical points of (1). The reverse implications are, in general, not true as illustrated in [27, Example 2] (see also Example 1 below).

(4)

In what follows we consider rules (2) and (5) and present practical manners to define trial points.

We recall that the conjugate of a convex functionf_i^∗ is fi^∗(gi) := sup

x∈Rⁿ

{hgi, xi −fi(x)}.

Since f1 is a closed and convex function, we have that x^k+1 ∈ ∂f1^∗(g^k2) (rule (2)) if and only if g^k2 ∈∂f1(x^k+1), [15, Prop. 6.1.2]. The latter inclusion is satisfied ifx^k+1is a solution of the following convex subproblem

x∈minRⁿ

f1(x)− hg^k₂, xi. (6)

Analogously, the conditionx^k+1∈∂f₁^∗(g^k₂+γ(x^k−x^k−1)) of (5) is satisfied ifx^k+1 solves

x∈Rminⁿf1(x)− hg₂^k+γ(x^k−x^k−1), xi. (7) We illustrate the behavior of the sequences generated by these two rules in the following example.

Example 1 (Transposing critical points that are notd-stationary.)Consider the bi-dimensional DC functionf(x) =f1(x)−f2(x), withf1(x) =kxk²andf2(x) = max(−x₁,0) + max(−x₂,0) + 0.5kxk². Its curve is plotted in Figure 1, as well as the behavior of some sequences of points generated by defining the next iteratex^k+1 as a solution of (6) (DCA, Figure 1(b)) and sequences generated by solving (7) (IDCA, Figure 1(c)) with inertial factorγ= 0.49.

-2 0 2 4 6 8

-2 0 -2

x₂

0 x₁

2 2

(a) DC function

-3 -2 -1 0 1 2

x1 -3

-2 -1 0 1 2 3

x2

(b) DCA

-3 -2 -1 0 1 2

x1 -3

-2 -1 0 1 2 3

x2

(c) IDCA withγ= 0.49

Fig. 1 Iterative process and level curves of function f(x) = f1(x)−f2(x), with f1(x) = kxk² and f2(x) = max(−x₁,0) + max(−x₂,0) + 0.5kxk². Comparison between the classical DCA and the proposed inertial DC algorithm.

For the four different starting points, DCA determined four different critical points, all presented in Table 1. However, the global solution ¯x= (−1,−1)^>is the onlyd-stationary point of the problem

¯

x f( ¯x) ∂f1 ( ¯x) ∂f2 ( ¯x) (−1,−1)> −1 (−2,−2)> (−2,−2)>

(0,−1)> −0.5 (0,−2)> (s,−2)>withs∈[−1,0]

(−1,0)> −0.5 (−2,0)> (−2, s)>withs∈[−1,0]

(0,0)> 0 (0,0)> (s1, s2 )>withs1, s2∈[−1,0]

Table 1 Critical points of functionf(x) =kxk²−[max(−x1,0) + max(−x2,0) + 0.5kxk²] determined by rule (2).

of minimizingf overR². ut

While the critical points computed by the classical DC algorithm depend strongly on the initial starting points, the inertial DC algorithm is able (in this example) to compute the d-stationary point (in this case a global solution) regardless the initial point. This is thanks to the inertial factor γ(x^k−x^k−1) that prevents the iterative process from stopping at critical points that are not d- stationary. In some situations it is also possible to overcome local solutions, as illustrated by the following example.

(5)

Example 2 (A two-variable nonconvex 1D denoising model.) We consider the following nonconvex denoising optimization problem for 1D signals:

x∈minRⁿ

µ

2kx−bk²+

n−1

X

i=1

φ(|xi+1−xi|).

The concave functionφ(t) := log(1 + 2t)/2 is employed to induce sparsity of the one-lag-difference of the reconstructed signal ¯x: one wishes to reconstruct piecewise constant signals. As it will be shown later (see Proposition 1) the above nonconvex objective is indeed a DC functionf =f1−f2, with possible DC components given byf1(x) := ^µ₂kx−bk²+Pn−1

i=1 |xi+1−xi|+kxk²andf2(x) :=

Pn−1

i=1 |x_i+1−xi| −Pn−1

i=1 φ(|x_i+1−xi|) +kxk². In order to analyze the iterative process yielded by rules (2) and (5) applied to this problem, we consider dimension n = 2 and parameters µ = 0.6, b= (0.1,3)^>. Notice that differently from Example 1, subproblems (6) and (7) do not have explicit solutions. We, therefore, compute iterates by solving these subproblems numerically up to a given tolerance. The iterative processes with five different initial points are presented in Figure 2. In three

-2 0 2 4

x₁ -2

0 2 4

x2

(a) DCA

-2 0 2 4

x₁ -2

0 2 4

x2

(b) IDCAγ= 0.7

-2 0 2 4

x₁ -2

0 2 4

x2

(c) IDCAγ= 0.9 Fig. 2 Iterative process and level curves of functionf(x) =^µ₂kx−bk²+Pn−1

i=1φ(|x_i+1−xi|), withµ= 0.6,b= (0.1,3)^>andφ(t) = log(1+2t)/2. Global solution is ¯x= (0.3970,2.7030)^>and the optimal value isf(¯x)≈0.91538.

The set of critical points off that does not contain the global solution is C={(t, t) : max{b1, b2} −1/µ≤t≤ min{b1, b2}+ 1/µ}, i.e,C≈ {(t, t) : 1.333≤t≤1.7667}.

of the five initial points, sequences generated by DCA could not converge to the global minimum

¯

x = (0.3970,2.7030)^>, but to the critical point (31/20,31/20)^> that is a local solution (thus a d-stationary point) of the problem. Instead, rule (5) was more successful due to inertial factorγ: for γ= 0.7, only one sequence converged to the critical point that is not the global minimum; yet, for γ= 0.9 (a larger factor of inertia) all five sequences converged to the global solution. ut Example 2 suggests to consider a larger factor of inertia γ ≥0. However, our analysis given in Section 3 shows thatγcannot be arbitrarily large: the inertial parameter can vary along the interval [0, ρ/2), whereρ≥0 is the constant of strongly convexity of the second component functionf2. We therefore assume in the remaining of this paper the following condition, which is a mild hypothesis in the DC setting:

Assumption A1.Functionf2 is strongly convex onΩ with a known parameter ρ >0, that is, for every g2∈∂f2(x)one has

f2(y)≥f2(x) +hg2, y−xi+ρ

2ky−xk², ∀x, y∈X. (8)

We care to mention that A1, also present in [35], is not a restrictive assumption at all. In fact, if A1 does not hold for a certain DC function f = ϕ−ψ we can obtain another DC decomposition of f satisfying A1 by adding an arbitrary strongly convex function ω : Ω → R to the component functions: note that f = f1−f2 with f1 = ϕ+ω and f2 = ψ+ω. Since one can always take ω(·) =k · k²(as we have considered in Example 2), thenρcan be assumed known in A1 without loss of generality (just takeρ= 2 in this case).

(6)

3 An inertial DC algorithmic pattern

In this section, we formalize our inertial DC algorithm (IDCA) represented by rule (5). We care to mention that if subproblem (7) is difficult to solve (e.g. when f1 is assessed via simulation, optimization, numerical multidimensional integration etc.) then defining trial points by rule (5) (as well as (2)) can be too time consuming. In order to overcome this difficulty (that arise very often in real-life applications) we follow the lead of the recent work [35] and define trial points as inexact solutions of subproblem (7). Differently from [35] (that requires x^k+1 to be sufficiently away from x^k) we definex^k+1 in a certain approximate subdifferential of f₁^∗ at point g₂^k+γ(x^k−x^k−1). The inexactness involved in the iterative process is automatically controlled by the algorithm in such a manner that errors vanish as the iterative process progresses. The IDCA is presented in the following algorithmic pattern.

Algorithm 1Inertial DC algorithmic pattern

1: Letx⁰∈ Dom(f1),λ∈[0,1) andγ∈[0,(1−λ)ρ/2) be given. Setx⁻¹=x⁰ 2: fork= 0,1,2, . . . do

3: Setd^k=γ(x^k−x^k−1) and findx^k+1∈Rⁿsuch that

∂k+1f1(x^k+1)∩∂f2(x^k) +d^k6=∅ with 0≤^k+1≤λρ

2kx^k+1−x^kk² (9) 4: Stop ifx^k+1=x^kandd^k= 0

5: end for

The following is an alternative to condition (9) that is suitable when the first component function f1 is smooth: computex^k+1∈Rⁿ such that

kg^k+1₁ −(g2^k+d^k)k ≤λρ

2kx^k+1−x^kk for some g^k+1₁ ∈∂f1(x^k+1) andg2^k∈∂f2(x^k). (10) If λ = γ = 0, then (9) is equivalent to (10) and, moreover, Algorithm 1 becomes the classic DC algorithm of [32]: the condition ∂f1(x^k+1)∩∂f2(x^k) 6= ∅ can be obtained if x^k+1 solves (6). If, instead, γ > 0 but λ= 0, then either (9) or (10) is equivalent to define x^k+1 by solving (7). In this case, iff1(x) =ϕ(x) +kxk²/2 and f2(x) =ψ(x) +kxk²/2 (for two convex functionϕandψ), then Algorithm 1 becomes a linearized proximal method with inertial force (see Section 4 below).

Furthermore, the choiceλ >0 means that subproblem (7) (or (6)) can be inexactly solved. Practical details on how to implement (9) and (10) forλ >0 are given in Section 4 below: in a particular case we will see that Algorithm 1 turns into a sort of proximal bundle method for DC programs, [10].

We now proceed to show that Algorithm 1 (employing either rule (9) or (10)) converges to a critical point of (1).

3.1 Convergence analysis

As point x^k+1 is chosen to satisfy ∂^k+1f1(x^k+1)∩∂f2(x^k) +d^k 6= ∅ (or ∂f1(x^k+1) 6= ∅ if rule (10) is employed), we conclude that ∂f1(x^k+1)6= ∅ and thereforex^k+1 ∈ Dom(f1), implying that {x^k} ⊂ Dom(f1). By assumption, there exists an open set inRⁿsatisfyingDom(f1)⊂Ω⊂ Dom(f2).

Hence, the sequence of points generated by Algorithm 1 is well defined. Furthermore, since f2 is a convex function, its subdifferential ∂f2 is locally bounded. As a result, any sequence {g^k} with g^k∈∂^k+1f1(x^k+1)∩∂f2(x^k) +d^kis bounded as long as{x^k}is bounded as well. With this in mind, we start the convergence analysis of Algorithm 1 with the following simple lemma.

Lemma 1 Suppose Algorithm 1 terminates at iteration k. Thenx^kis critical point of problem (1).

Proof Under these assumptions, the algorithm stopping test yieldsx^k+1 = x^k andd^k = 0. Either (9) or (10) provides∂f1(x^k)∩∂f2(x^k)6=∅, i.e.,x^kis a critical point of problem (1). ut

(7)

The following lemma plays an important role in the convergence analysis of Algorithm 1.

Lemma 2 Let {x^k} be the sequence generated by Algorithm 1. If assumption A1 holds,λ∈ [0,1) andγ∈[0,(1−λ)ρ/2), then

f(x^k+1) + (1−λ)ρ−γ

2 kx^k+1−x^kk² ≤f(x^k) +(1−λ)ρ−γ

2 kx^k−x^k−1k²+

−(1−λ)ρ−2γ

2 kx^k−x^k−1k² for allk= 0,1,2, . . . Proof Suppose that x^k+1 satisfies (9), and let g₂^k+d^k ∈ ∂^k+1f1(x^k+1) with g₂^k ∈ ∂f2(x^k) and d^k=γ(x^k−x^k−1). Thenf1(x)≥f1(x^k+1) +hg₂^k+d^k, x−x^k+1i −^k+1for allx∈Ω. In particular, by settingx=x^k and recalling that^k+1≤λ^ρ₂kx^k+1−x^kk² by (9) we get

f1(x^k)≥f1(x^k+1) +hg₂^k+d^k, x^k−x^k+1i −λρ

2kx^k+1−x^kk². (11) Suppose now thatx^k+1 satisfies (10). Then

f1(x^k)≥f1(x^k+1) +hg^k+1₁ , x^k−x^k+1i

=f1(x^k+1) +hg^k₂+d^k, x^k−x^k+1i − hg^k₂+d^k−g^k+1₁ , x^k−x^k+1i

≥f1(x^k+1) +hg^k₂+d^k, x^k−x^k+1i − kg^k+1₁ −(g₂^k+d^k)kkx^k+1−x^kk,

where the last step is due to the Cauchy-Schwartz inequality. Since condition (10) ensures that kg₁^k+1−(g₂^k+d^k)k ≤λ^ρ₂kx^k+1−x^kk, we get once again inequality (11).

Assumption A1 gives the inequalityf2(x^k+1)≥f2(x^k) +hg₂^k, x^k+1−x^ki+^ρ₂kx^k+1−x^kk², which summed to (11) yieldsf1(x^k)−f2(x^k)≥f1(x^k+1)−f2(x^k+1)+hd^k, x^k−x^k+1i+(1−λ)^ρ₂kx^k+1−x^kk². Therefore,

f(x^k+1) + (1−λ)ρ

2kx^k+1−x^kk²≤f(x^k) +hd^k, x^k+1−x^ki.

Sincehd^k, x^k+1−x^ki=γhx^k−x^k−1, x^k+1−x^ki ≤^γ₂kx^k−x^k−1k²+^γ₂kx^k+1−x^kk²it follows that f(x^k+1) + (1−λ)ρ−γ

2 kx^k+1−x^kk²≤f(x^k) +γ

2kx^k−x^k−1k².

The stated result thus follows from the identity ^γ₂ = ^{(1−λ)ρ−γ}₂ −^{(1−λ)ρ−2γ}₂ . ut Asλ∈[0,1) andγ∈[0,(1−λ)ρ/2) we have that ^{(1−λ)ρ−γ}₂ ≥ ^{(1−λ)ρ−2γ}₂ >0. Hence the sequence {f(x^k) +^{(1−λ)ρ−γ}₂ kx^k−x^k−1k²} is monotonically decreasing. This property is enough to prove convergence of Algorithm 1. We care to mention that the sequence of functional value{f(x^k)}is not necessary monotone, in contrast to all other DC algorithms found in the literature (see for instance, [32, 27, 29, 20, 12, 10]). Remind that the non-monotonicity of the function values was crucial for IDCA to escape from the local solution in Example 2.

Theorem 1 Consider Algorithm 1, assume A1, λ∈[0,1), γ∈[0,(1−λ)ρ/2), and x⁰∈ Dom(f1).

Assume also that the level set {x ∈Rⁿ :f(x)≤ f(x⁰)} is bounded and infx∈Rⁿf(x)>−∞. Then any cluster pointx¯of the sequence{x^k}generated by the algorithm is a critical point of problem(1).

Proof If the algorithm terminates at a certain iterationk, Lemma 1 provides the result. Let us assume the algorithm does not stop. Lemma 2 ensures that the sequence{f(x^k) +^{(1−λ)ρ−γ}₂ kx^k−x^k−1k²} is monotonically decreasing. Since

f(x^k)≤f(x^k) +(1−λ)ρ−γ

2 kx^k−x^k−1k²≤f(x⁰) +(1−λ)ρ−γ

2 kx⁰−x⁻¹k²=f(x⁰),

(8)

the assumption of bounded level set ensures that{x^k}is a bounded sequence. By summing up the inequality in Lemma 2 we obtain

0≤ (1−λ)ρ−2γ 2

∞

X

k=0

kx^k−x^k+1k²

≤

∞

X

k=0

f(x^k) +(1−λ)ρ−γ

2 kx^k−x^k−1k²−

f(x^k+1) +(1−λ)ρ−γ

2 kx^k+1−x^kk²

=f(x⁰)−lim

k

f(x^k+1) + (1−λ)ρ−γ

2 kx^k+1−x^kk²

≤f(x⁰)− inf

x∈Rⁿf(x)<∞. (12) As a result, limkkx^k−x^k+1k= 0 and thus, limk^k+1= 0 (Eq. (10)) and limkd^k= 0.

Let K ⊂ {0,1, . . .} be an index set such that limk∈Kx^k+1 = ¯x. Since the bounded subsequence {x^k}K belongs to the open set Ω ⊂ Dom(f2) and f2 is convex, then by [15, Theorem 3.1.2] the sequence{g^k₂}K is bounded as well and, therefore, it has a cluster point ¯g2. LetK⁰⊂K be a index set such that limk∈K⁰g^k2 = ¯g2. By recalling that g^k2+d^k∈∂^k+1f1(x^k+1) if (9) holds and that the subdifferentials of convex functions are closed sets, we conclude that ¯g2∈∂f1(¯x) regardless the rule (9) or (10) (because limk∈K⁰x^k+1= ¯x and limk∈K⁰g^k₂ = ¯g2). It remains to show that ¯g2∈∂f2(¯x).

But this follows from the fact thatg^k₂ ∈∂f2(x^k) and 0≤limk∈K⁰kx^k−xk ≤¯ limk∈K⁰kx^k−x^k+1k+ limk∈K⁰kx^k+1−xk¯ = 0 (where we have used the triangular inequality onkx^k−x^k+1−(¯x−x^k+1)k and recalled the limits established above). Therefore, the closeness of the subdifferential off2yields

¯

g2∈∂f2(¯x) and the proof is complete. ut

Once the convergence analysis of Algorithm 1 has been established, we now turn our attention to its speed of convergence. To this end, we consider a particular instance of Algorithm 1.

3.2 Rate of convergence

Throughout this section, we assume thatf1(x) =ϕ(x) +kxk²/2 andf2(x) =ψ(x) +kxk²/2, yielding ρ= 1 in A1 andγ∈[0,1/2) in Algorithm 1. We moreover assume thatλ= 0 in condition (9) and let g^k_ψ be an arbitrary subgradient ofψat pointx^k. In this manner, determiningx^k+1 satisfying either (9) or (10) results in defining (forg^k2 =g_ψ^k+x^k)

x^k+1 = arg min

x∈Rⁿϕ(x) +kxk²/2− hg_ψ^k+x^k+γ(x^k−x^k−1), xi

= arg min

x∈Rⁿ

ϕ(x) +1

2kx−(x^k+gψ^k+γ(x^k−x^k−1))k² := (I+∂ϕ)⁻¹(x^k+gψ^k+γ(x^k−x^k−1)).

As a result, this choice of parameter makes Algorithm 1 a linearized proximal method with inertia force. This allows us to rely on the analysis of [26,§ 4.6] to establish the rate of convergence of this method applied to the DC program (1). To this end, we denote byr(x) the following residuum

r(x) :=x−(I+∂ϕ)⁻¹(x+gψ), withgψ∈∂ψ(x).

Notice that ifr(x^k) = 0, thenx^ksolves minx∈Rⁿϕ(x) +¹₂kx−(x^k+g^k_ψ)k². Its optimality condition yieldsg^k_ψ∈∂ϕ(x^k), showing thatx^kis a critical point for problem (1), which reads for this particular setting as minx∈Rⁿϕ(x)−ψ(x). The following result shows that the rate of convergence of both sequences{kr(x^k)k²}and{kx^k+1−x^kk²}isO(_k¹).

Theorem 2 Suppose that ¯f := infxf(x)>−∞, γ∈[0,1/2)and letk >0. Then

(a) min

i∈{0,...,k}kxⁱ⁺¹−xⁱk²≤ 2

1−2γ

f(x⁰)−¯f k+ 1 and

(b) min

i∈{0,...,k}kr(xⁱ)k²≤ 16

1−2γ

f(x⁰)−¯f k+ 1 .

(9)

Proof Item (a) follows directly from the following developing (k+ 1)1−2γ

2 min

i∈{0,...,k}kxⁱ⁺¹−xⁱk²≤ 1−2γ 2

k

X

i=0

kxⁱ−xⁱ⁺¹k²≤f(x⁰)−¯f,

where the last inequality is due to (12) by recalling thatλ= 0 andρ= 1. We now proceed to show item (b). By using the non-expansiveness of the operator (I+∂ϕ)⁻¹ we have that

kxⁱ−xⁱ⁻¹k ≥γkxⁱ−xⁱ⁻¹k=kxⁱ+gⁱ_ψ+γ(xⁱ−xⁱ⁻¹)−(xⁱ+gⁱ_ψ)k

≥ k(I+∂ϕ)⁻¹(xⁱ+g_ψⁱ +γ(xⁱ−xⁱ⁻¹))−(I+∂ϕ)⁻¹(xⁱ+g_ψⁱ)k

=kxⁱ⁺¹−(I+∂ϕ)⁻¹(xⁱ+g_ψⁱ)k. (13) As a result,

kxⁱ⁺¹−xⁱk ≥ kxⁱ⁺¹−xⁱk+kxⁱ⁺¹−(I+∂ϕ)⁻¹(xⁱ+gⁱ_ψ)k − kxⁱ−xⁱ⁻¹k

≥ kxⁱ−(I+∂ϕ)⁻¹(xⁱ+g_ψⁱ)k − kxⁱ−xⁱ⁻¹k

=kr(xⁱ)k − kxⁱ−xⁱ⁻¹k

where the first inequality follows from (13) and the second one by the triangular inequality. Therefore, kr(xⁱ)k ≤2 max{kxⁱ⁺¹−xⁱk,kxⁱ−xⁱ⁻¹k}. Item (b) follows from the following development:

(k+ 1) min

i∈{0,...,k}kr(xⁱ)k² ≤

k

X

i=0

kr(xⁱ)k²≤4

k

X

i=0

max{kxⁱ⁺¹−xⁱk²,kxⁱ−xⁱ⁻¹k²}

≤4

" _k X

i=0

kxⁱ⁺¹−xⁱk²+

k

X

i=0

kxⁱ−xⁱ⁻¹k²

#

≤8

k

X

i=0

kxⁱ⁺¹−xⁱk²≤8 2

1−2γ[f(x⁰)−¯f],

where the second inequality holds because the arguments in the max-function are positive, and the

last one is again due to (12) withλ= 0 andρ= 1. ut

The above theorem shows that the choice γ= 0 (no inertia) gives a better bound on the com- plexity of this specific instance of Algorithm 1. However, the numerical experiments in Section 6 indicate that the choiceγ >0 provides a better performance in practice.

4 Practical details and implementing issues

In this section we comment on some implementation aspects of Algorithm 1.

4.1 Computing a point satisfying condition (9)

Let us assume the first DC component is given byf1(x) =ϕ(x) +ω(x) +iX(x), whereϕ, ω:Ω→R are convex functions,ω is differentiable (e.g. ω≡ 0), and X 6= ∅ is a convex set. Firstly, compute (arbitrarily) a subgradient of f2 at point x^k (i.e., g2^k ∈ ∂f2(x^k)) and define d^k = γ(x^k−x^k−1).

Secondly, letνbe an (inner) iteration counter and ˇϕ^ν a (cutting-plane) model approximatingϕfrom below, i.e., ˇϕ^ν(x)≤ϕ(x) for allx∈Ω. Letz^ν+1 be a solution of the following convex subproblem

x∈Xminϕˇ^ν(x) +ω(x)− hg^k₂+d^k, xi. (14)

(10)

Optimality condition for this subproblem yieldsg^k₂+d^k∈∂ϕˇ^ν(z^ν+1) +∇ω(z^ν+1) +NX(z^ν+1). Since z^ν+1 ∈X, we have thatNX(z^ν+1) =∂iX(z^ν+1); [17]. Hence, by defining^ν :=ϕ(z^ν)−ϕˇ^ν(z^ν)≥0 we get

f1(x) =ϕ(x) +ω(x) +iX(x)≥ϕˇ^ν(x) +ω(x) +iX(x)

≥ϕˇ^ν(z^ν+1) +ω(z^ν+1) +iX(z^ν+1) +hg₂^k+d^k, x−z^ν+1i

=ϕ(z^ν+1) +ω(z^ν+1) +iX(z^ν+1) +hg₂^k+d^k, x−z^ν+1i −^ν+1,

showing thatg^k₂+d^kbelongs to∂^ν+1f1(z^ν+1). Then, the pointx^k+1:=z^ν+1 satisfies (9) if^ν+1= ϕ(z^ν+1)−ϕˇ^ν(z^ν+1)≤λ^ρ₂kz^ν+1−x^kk².

We care to mention that specialized bundle algorithms (see [11]) applied to the problem minx∈Xϕ(x)+

ω(x)− hg₂^k+d^k, xiautomatically generates a model ˇϕ^ν and sequence{z^ν}satisfying limν[ϕ(z^ν+1)− ˇ

ϕ^ν(z^ν+1)] = 0. As a result, if there is no such a pointx^k+1=z^ν+16=x^ksatisfying (9), then it means that {z^ν+1} converges to x^k faster than {^ν+1} converges to zero. Since g₂^k+d^k ∈ ∂^ν+1f1(z^ν+1) (=∂^ν+1ϕ(z^ν+1) +∇ω(z^ν+1)), we conclude in this case thatg₂^k+d^k∈∂f1(x^k) by the closeness of subdifferential of convex functions. This shows thatx^ksolves the problem minxf1(x)− hg^k₂+d^k, xi.

In this case, if 0 = d^k = γ(x^k−x^k−1) (e.g. γ = 0), then x^k is a critical point of (1). Otherwise Algorithm 1 definesx^k+1=x^k,d^k+1 = 0 and the process continues. Notice that ifγ= 0 andx^k is not a critical point, condition (9) can be satisfied after finitely many steps.

This scheme can be seen as an alternative to the DC bundle algorithms [20, 12, 10] if ω(·) :=

ρk · k²/2 and one employs the following descent test (implying (9)):

A new descent test for proximal DC bundle methods

1: ifϕ(z^ν+1)−ϕˇ^ν(z^ν+1)≤λ^ρ₂kz^ν+1−x^kk² then 2: perform a serious step (x^k+1=z^ν+1) 3: else

4: perform a null step (setν=ν+ 1, update the model, and solve (14) again).

5: end if

This new bundle algorithm scheme includes the “inertial procedure” not present in the methods [20, 12, 10]. However, a downside of the above scheme compared to the algorithms in [20, 12, 10] is that it keeps the prox-parameter fixed (toρ) along the iterative process.

4.2 Computing a point satisfying condition (10)

We now show how to compute x^k+1 satisfying (10) for a particular class of problem (1) whose first component f1 is continuously differentiable and the domains of both f1 and f2 is the whole spaceRⁿ. Given an arbitrary vectorg^k₂ ∈∂f2(x^k), lety(x^k)∈Rⁿ be a solution of subproblem (7).

Under the given assumptions, it follows that ∇f1(y(x^k))−(g₂^k+d^k) = 0. Let {z^ν} be a sequence of points generated by a convergent algorithm applied to (7) (e.g. a Newtonian method, [19]) such that limν∈N⁰z^ν+1 = y(x^k). Then, by continuity of ∇f1 we conclude that limν∈N⁰∇f1(z^ν+1) =

∇f1(y(x^k)) =g^k₂+d^k. If there is no indexν ∈N⁰such thatk∇f1(z^ν+1)−(g^k₂+d^k)k ≤λ^ρ₂kz^ν+1−x^kk, then {z^ν+1}_N0 would converge to x^k faster than{∇f₁(z^ν+1)}_N0 converges tog₂^k+d^k. This yields y(x^k) =x^k and∇f1(x^k) =g^k₂+d^k, proving that x^k is a critical point of (1) ifd^k= 0 Otherwise, Algorithm 1 sets x^k+1=x^k (resultingd^k+1 = 0) and proceeds to next iteration. (Ifd^k= 0 andx^k is not a critical point, the condition of (10) can be satisfied after finitely many steps by some inner iteratez^ν+1.)

4.3 Related methods

It has been shown that Algorithm 1 boils down to the DC algorithm of [32] if γ = λ= 0 and to a variant of DC proximal bundle methods if condition (9) is computed by the scheme discussed in Subsection 4.1. We now show that our algorithm extends other known optimization methods.

(11)

Suppose that ω : Rⁿ →R+ is a strongly convex and continuously differentiable function, and thatf1(x) =ϕ(x) +ω(x) +iX(x) andf2(x) =ψ(x) +ω(x), i.e.,ω is a regularizing function. Under this assumption,g2^k ∈∂f2(x^k) is given byg2^k = g_ψ^k +∇ω(x^k), withg_ψ^k ∈∂ψ(x^k). Accordingly, by adding the constant term−ω(x^k) +hg_ψ^k+∇ω(x^k) +d^k, x^kito problem (7) we get

minx∈Xϕ(x) +ω(x)−ω(x^k)− hg_ψ^k+∇ω(x^k) +d^k, x−x^ki

minx∈Xϕ(x)− hg^k_ψ+d^k, x−x^ki+D(x, x^k), (15) with D(x, x^k) := ω(x)−ω(x^k)− h∇ω(x^k), x−x^ki the Bregman function induced by ω. Hence, Algorithm 1 becomes an Inertial Linearized Proximal Method for DC programming, a new variant of proximal methods. Ifγ= 0 (yielding d^k= 0) and ω(·) =k · k²/2, thenD(x, x^k) =kx−x^kk²/2 and Algorithm 1 is just the linearized proximal method of [29]. Ifγ >0 and ω(·) =k · k²/2, then Algorithm 1 can be seen as an extension of the Ipiano algorithm of [26] to deal with nonsmooth DC programs and inexact subproblem solutions (the original algorithm of [26] assumes that the nonconvex function -ψis differentiable and its gradient is Lipschitz continuous, an assumption that is not required here). Suppose now that ψ is a concave function and ω(·) = k · k²/2. Then Algo- rithm 1 becomes a proximal subgradient splitting method [7] applied to the (now convex) problem minx∈Xϕ(x)−ψ(x).

Furthermore, suppose that ψ:= 0. In this case, subproblem (15) yields an inertial iteration of a proximal bundle method applied to the convex program minx∈Xϕ(x). In addition, if functionϕin (15) is replaced with a cutting-plane-model ˇϕ^ν and the descent test of Subsection 4.1 is employed, Algorithm 1 results in aninertial proximal bundle method for convex programs. Different from other proximal bundle methods, this is a non-monotone method due to the inertial term.

5 Application of interest: nonconvex image denoising

Image reconstruction techniques have become important tools in computer vision systems and many other applications that require sharp images obtained from noisy/corrupted ones. The convex total variation (TV) formulations have proven to provide a good mathematical basis for several basic operations in image reconstruction [8]. In order to present such formulations, let b ∈ Rⁿ be the vectorization of a corruptedn1×n2 grayscaleimageB(in this case,n=n1·n2). A TV formulation, with regularizing functionφ:R⁺→R, consists in solving the following minimization problem

x∈minRⁿ

µ

2kx−bk²+T Vφ(x), with T Vφ(x) :=

n

X

i=1

φ(k(∇x)ik), (16) where µ > 0 is a fidelity parameter and (∇x)_i ∈R² denotes the discretization of the gradient of imagexat pixeli, that is, (∇x)i represents finite difference approximations of first-order horizontal and vertical partial derivatives. In a matrix representationX ∈Rⁿ¹^×n² ofx∈Rⁿwe have that

(∇x)_i:=

Xl+1,j−Xl,j

Xl,j+1−Xl,j

, withi^ththe coordinate ofxwhere the pixelXl,j is stored.

Thusk(∇x)_ik=p

(Xl+1,j−Xl,j)²+ (Xl,j+1−Xl,j)².

If the regularizing function is chosen to be φ(r) = r, then problem (16) consists in a convex nonsmooth optimization problem that can be efficiently solved by several specialized algorithms such the ones proposed in [6, 1, 9]. This is the main benefit of using a convex formulation for image denoising. Nevertheless, nonconvex regularizations have remarkable advantages over convex ones for restoring images, in particular, high-quality piecewise constant images with neat edges [25]. In order to preserve edges in the restoration process, some authors [36, 21] employ a nonconvex regularizing function φ to induce sparse image gradients (∇x)i. This makes (16) a nonconvex and nonsmooth problem, that has been recently dealt with by local (strongly) convex approximations in [21].

In what follows we show that for a wide class of regularizing functionsφ:R⁺→R, problem (16) is indeed a DC programming problem with available DC decompositions.

(12)

5.1 DC decomposition of nonconvex denoising models

Assume thatφ:R+→R is a concave and non-decreasing function. As a result, its right derivative is well defined for allr≥0, that is, the limitφ⁰₊(r) := limh↓0φ(r+h)−φ(r)

h exists for allr≥0. Our goal is to prove that under these assumptions, the composite function T Vφ(x) =Pn

i=1φ(k(∇x)_ik) can be written as a difference of two convex functions. To this end, we will need the following useful result, whose proof can be obtained by combining some developments presented in [34,§ 4]. For sake of completeness, we provide below a short proof.

Lemma 3 Let c:Rⁿ→R⁺ be a convex function. If φ:R⁺ →R is a concave and non-decreasing function such thatφ⁰+(0)<∞, thenτ c(x)−φ(c(x))is a convex function for allτ ≥φ⁰+(0).

Proof The hypotheses on φ imply that φ⁰+(r) ≥ 0 for all r ≥ 0 and φ⁰+(r0) ≥ φ⁰+(r1) for all 0≤ r0 ≤r1. Letτ ≥ φ⁰+(0)≥ 0 be fixed and define the convex functionw(r) =τ r−φ(r) for all r≥0. It follows that

w+⁰(r) =τ −φ⁰+(r)≥φ⁰+(0)−φ⁰+(r)≥0 ∀r≥0,

showing that w is non-decreasing on R+. As a result, the composite function w(c(x)) is convex as well (becausecis). Sincew(c(x)) =τ c(x)−φ(c(x)), the result follows. ut Proposition 1 Letφ:R⁺→Rbe as in Lemma 3, and unregularized TV function give byT V(x) :=

Pn

i=1k(∇x)ik.Thenτ T V(x)−T Vφ(x)is a convex function for allτ ≥φ⁰₊(0)≥0.

Proof Fix a pixel i. Since c(x) = k(∇x)ik is convex and non-negative, Lemma 3 ensures that τk(∇x)ik −φ(k(∇x)ik) is a convex function. The result thus follows from the fact that the sum of convex functions is convex and definitions ofT V andT Vφ. ut Since the requirements on the penalizing functionφare mild, Proposition 1 is quite general. For instance, all the functionsφconsidered in [21] satisfy the assumptions of Proposition 1 withτ ≥1.

(This ensures that functionf2 in Example 2 is indeed convex.)

φlog φrat φatan φexp

φp(r) ^log(1+pr)_p _1+pr/2^r atan((1+pr)/√ 3)−π/6 p√

3/2

1−exp(−pr) p

φ⁰_p(r) _1+pr¹ ¹

(1+pr/2)²

1 1+pr+p²r²

1 exp(pr)

Table 2 Some examples of concave, differentiable and non-decreasing penalizing functionsφp:R+→R+parareme- terized byp >0.

5.2 Specific DC models

We now focus on subproblem (7) resulting from the considered nonconvex image denoising model.

Without loss of generality, we assume in the remaining of this paper that the regularizing function φ:R⁺→Rsatisfiesφ⁰+(0) = 1 (this is the case of the functions presented in Table 2). In this setting, the DC function reads as

f(x) = µ

2kx−bk²+T Vφ(x) = µ+ρ

2 kx−bk²+T V(x)

| {z }

f1(x)

−hρ

2kx−bk²+T V(x)−T Vφ(x) i

| {z }

f2(x)

(17)

In the above definition, ρ >0 is an arbitrarily chosen parameter to satisfy Assumption A1. In our numerical experiments we setρ= 1. Sincef is nonnegative and centered aroundb,f is coercive and thus has bounded-level sets, satisfying thus the assumption of Theorem 1.

With this formulation, subproblem (7) reads as

x∈Rminⁿ µ+ρ

2 kx−η^kk²+T V(x), with η^k:=b+ (g2^k+γ(x^k−x^k−1)/(µ+ρ), (18)