On Proximal Subgradient Splitting Method for Minimizing the sum of two Nonsmooth Convex Functions

(1)

DOI 10.1007/s11228-016-0376-5

On Proximal Subgradient Splitting Method for Minimizing the sum of two Nonsmooth Convex Functions

Jos´e Yunier Bello Cruz¹

Received: 3 May 2015 / Accepted: 16 May 2016 / Published online: 30 May 2016

Abstract In this paper we present a variant of the proximal forward-backward splitting iteration for solving nonsmooth optimization problems in Hilbert spaces, when the objective function is the sum of two nondifferentiable convex functions. The proposed iteration, which will be called Proximal Subgradient Splitting Method, extends the classical subgradient iteration for important classes of problems, exploiting the additive structure of the objective function. The weak convergence of the generated sequence was established using different stepsizes and under suitable assumptions. Moreover, we analyze the complexity of the iterates.

Keywords Convex problems·Nonsmooth optimization problems·Proximal forward-backward splitting iteration·Subgradient method

Mathematics Subject Classification (2010) 65K05·90C25·90C30

1 Introduction

The purpose of this paper is to study the convergence properties of a variant of the proximal forward-backward splitting method for solving the following optimization problem:

minf (x)+g(x) s.t.x∈H, (1)

whereHis a nontrivial real Hilbert space, andf :H→R:=R∪ {+∞}andg:H→R are two proper lower semicontinuous and convex functions. We are interested in the case where both functionsf andg are nondifferentiable, and when the domain off contains the domain ofg. The solution set of this problem will be denoted byS_∗, which is a closed

Jos´e Yunier Bello Cruz [email protected]

1 Department of Mathematical Sciences, Northern Illinois University, DeKalb, IL 60115, USA

(2)

and convex subset of the domain ofg. Problem (1) has recently received much attention from the optimization community due to its broad applications to several different areas such as control, signal processing, system identification, machine learning and restoration of images; see, for instance, [18,19,24,32] and the references therein.

A special case of problem (1) is the nonsmooth constrained optimization problem, taking g =δ_C whereδ_C is the indicator function of a nonempty closed and convex setCinH, defined byδC(y) := 0, ify ∈ Cand+∞, otherwise. Then, problem (1) reduces to the constrained minimization problem

min f (x) s.t.x∈C. (2)

Another important case of problem (1), which has had much interest in signal denoising and data mining, is the following optimization problem with1-regularization

min f (x)+λx1 s.t.x∈H, (3) whereλ >0 and the norm · 1is used to induce the sparsity in the solutions. Moreover, problem (3) covers the important and the well-studied₁-regularized least square minimization problem, whenH= Rⁿandf (x)= Ax−b²2 whereA ∈ R^m^×ⁿ,m << n, and b ∈R^m, which is just a convex approximation of the very famous₀-minimization problem; see [12]. Recently, this problem became popular in signal processing and statistical inference; see, for instance, [23,43].

We focus here on the so-calledproximal forward-backward splitting iteration[32], which contains a forward gradient step off (an explicit step) followed by a backward proximal step ofg (an implicit step). The main idea of our approach consists of replacing, in the forward step of the proximal forward-backward splitting iteration, the gradient off by a subgradient off (note that herefis assumed nondifferentiable in general). In the particular case thatgis the indicator function, the proposed iteration reduces to the classical projected subgradient iteration.

To describe and motivate our iteration, first we recall the definition of the so-calledprox- imal operatorasprox_g : H→Hassociated toga proper lower semicontinuous convex function, whereprox_g(z),z ∈ His the unique solution of the following strongly convex optimization problem

min g(y)+1

2y−z² s.t.y∈H. (4)

Note that the norm · is induced by this inner product ofH,i.e.,x := √ x, x for allx ∈H. The proximal operatorprox_gis well-defined and has many attractive properties,e.g., it is continuous and firmly nonexpansive,i.e., for allx, y ∈ H,prox_g(x)− prox_g(y)²≤ x−y²− [x−prox_g(x)] − [y−prox_g(y)]². This nice property can be used to construct algorithms to solve optimization problems [39]; for other properties and algebraic rules see [3,18,19]. Ifg=δC is the indicator function, the orthogonal projection ontoC,PC(x):= {y∈C: x−y =dist(x, C)}is the same asprox_δ_C(x)for allx∈H [2]. For an exhaustive discussion about the evaluation of the proximity operator of a wide variety of functions see Section 6 of [32]. Now, let us recall the definition of the subdifferential operator∂g:H⇒Hby∂g(x):= {w∈H:g(y)≥g(x)+ w, y−x, ∀y∈H}. We also present the relation of the proximal operatorprox_αgwith the subdifferential operator∂g,i.e.,prox_αg = (Id+α∂g)⁻¹ and as a direct consequence of the first optimality condition of (4), we have the following useful inclusion:

z−prox_αg(z)

α ∈∂g(prox_αg(z)), (5)

(3)

for anyz∈Handα >0. The iteration proposed in this paper, calledProximal Subgradient Splitting Method, is motivated by the well-known fact thatx∈S_∗if and only if there exists u ∈ ∂f (x) such thatx = prox_αg(x−αu). Thus, the iteration generalizes the proximal forward-backward splitting iteration for the differentiable case, as a fixed point iteration of the above equation, which is defined as follows: starting atx⁰belonging to the domain of g, set

x^k+1=prox_α

kg(x^k−α_ku^k), (6)

whereu^k ∈∂f (x^k)and the stepsizeαkis positive for allk∈N. Iteration (6) recovers the classical subgradient iteration [38], wheng=0, and the proximal point iteration [39], when f =0. Moreover, it covers important situations in whichf is nondifferentiable and it can also be seen as aforward-backward Euler discretizationof the subgradient flow differential inclusion

˙

x(t)∈ −∂[f (x(t))+g(x(t))],

with variablex:R₊→H; see [32]. Actually, if the derivative on the left side is replaced by the divided difference(x^k+1−x^k)/α_k, then the discretization obtained is(x^k−x^k+1)/α_k∈

∂f (x^k)+∂g(x^k+1),which is the proximal subgradient iteration (6).

The nondifferentiability of the functionfhas a direct impact on the computational effort and the importance of such problems, whenf is nonsmooth, is underlined because they occur frequently in applications. Nondifferentiability arises, for instance, in the problem of minimizing the total variation of a signal over a convex set, in the problem of minimizing the sum of two set-distance functions, in problems involving maxima of convex functions, the Dantzing selector-type problems, the non-Gaussian image denoising problem and in Tykhonov regularization problems withL1 norms and others; see, for instance, [4,13,17, 26]. The iteration of the proximal subgradient splitting method, proposed in (6), can be applied in these important instances, extending the classical subgradient iteration for more general problems as (3). In problem (1),fis usually assumed to be differentiable as in [35], which is not necessarily the case in this work. Moreover, the convergence of the iteration (6) to a solution of (1) has been established in the literature, when the gradient offis globally Lipschitz continuous and the stepsizesα_k,k ∈ Nhave to be chosen very small,i.e., for allk,α_k is less than some constant related with the Lipschitz constant of the gradient of f; see, for instance, [19]. Recently, whenf is continuously differentiable but the Lipschitz constant is not available, the steplengths can be chosen using backtracking procedures; see [6,10,32,35].

It is important to mention that the forward-backward iteration also finds applications in solving more general problems, like the variational inequality and inclusion problems; see, for instance, [9,11,14,15,42] and the references therein. On the other hand, the standard convergence analysis of this iteration, for solving these general problems, requires at least a co-coercivity assumption of the operator and the stepsizes to lie within a suitable inter- val; see, for instance, Theorem 25.8 of [3]. Note that co-coercive operators are monotone and Lipschitz continuous, but the converse does not hold in general; see [44]. Although, for gradients of lower semicontinuous, proper and convex functions, the co-coercivity is equiv- alent to the global Lipschitz continuity assumption. This nice and surprising fact, which is strongly used in the convergence analysis of the proximal forward-backward method for problem (1), whenf is differentiable, is known as theBaillon-Haddad Theorem; see Corollary 18.16 of [3].

The main aim of this work is to remove the differentiability assumption fromf in the forward-backward splitting method, extending the classical projected subgradient method

(4)

and containing, as particular case, a new proximal subgradient iteration for more general problems.

This work is organized as follows. The next subsection provides our notations and assumptions, and some preliminaries results that will be used in the remainder of this paper. The proximal subgradient splitting method and its weak convergence are analyzed by choosing different stepsizes in Section2. Finally, Section3gives some concluding remarks.

1.1 Assumptions and Preliminaries

In this section, we present our assumptions, classical definitions and some results needed for the convergence analysis of the proposed method.

We start by recalling some definitions and notation used in this paper, which are standard and follows from [3,32]. Throughout this paper, we write p := q to indicate that pis defined to be equal toq. We writeNfor the nonnegative integers{0,1,2, . . .}and remind that the extended-real number system isR:=R∪{+∞}. The closed ball centered atx∈H with radiusγ >0 will be denoted byB[x;γ],i.e.,B[x;γ] := {y∈H: y−x ≤γ}. The domain of any functionh:H→R, denoted bydom(h), is defined asdom(h):= {x∈H: h(x) <+∞}. The optimal value of problem (1) will be denoted bys_∗:=inf{(f +g)(x): x∈H}, noting that whenS_∗= ∅,s_∗=min{(f +g)(x):x ∈H} =(f +g)(x_∗)for any x_∗∈S_∗. Finally,1(N)denotes the set of summable sequences in[0,+∞).

Throughout this paper we assume the following:

A1.∂f is bounded on bounded sets on the domain ofg,i.e., there existsζ =ζ (V ) >0 such that∂f (x) ⊆ B[0;ζ]for allx ∈ V, whereV is any bounded and closed subset of dom(g).

A2. ∂g has bounded elements on the domain of g,i.e.,∃ρ ≥ 0 such that ∂g(x)∩ B[0;ρ] = ∅for allx∈dom(g).

In connection with Assumption A1, we recall that∂f is locally bounded on its open domain. In finite dimension spaces, this result implies thatA1always holds whendom(f )is open. A widely used sufficient condition forA1is the Lipschitz continuity offondom(g).

Furthermore, the boundedness of the subgradients is crucial for the convergence analysis of many classical subgradient methods in Hilbert spaces and it has been widely considered in the literature; see, for instance, [1,8,9,38].

Regarding AssumptionA2, we emphasize that it holds trivially for important instances of problem (1),e.g., problems (2) and (3) because∂δ_C(x)=N_C(x)and∂x1= {u∈H: u∞ ≤1, u, x = x1}, respectively, or whendom(g)is a bounded set or also when His a finite dimensional space. Note that AssumptionA2allows instances where∂gis an unbounded set as is the particular case whengis the indicator function. It is an existence condition, which is in general weaker thanA1.

Let us end the section by recalling the well-known concepts so-called quasi-Fej´er and Fej´er convergence.

Definition 1.1 LetSbe a nonempty subset ofH. A sequence(x^k)_k∈N inHis said to be quasi-Fej´er convergent toSif and only if for allx ∈Sthere exists a sequence(_k)_k∈Nin ₁(N)andx^k⁺¹−x² ≤ x^k−x²+_kfor allk∈N. When(_k)_k∈Nis a null sequence, we say that(x^k)_k∈Nis Fej´er convergent toS.

The definition originates in [22] and has been elaborated further in [16]. This definition, originated in [22], has been elaborated further in [16]. In the following we present two well-known fact for quasi-Fej´er convergent sequences.

(5)

Fact 1.1 If the sequence(x^k)_k∈Nis quasi-Fej´er convergent toS, then:

(a) The sequence(x^k)_k∈Nis bounded.

(b) (x^k)k∈Nis weakly convergent iff all weak accumulation points of(x^k)k∈Nbelong to S.

Proof Item(a)follows from Proposition 3.3(i) of [16], and Item(b)follows from Theorem 3.8 of [16].

2 The Proximal Subgradient Splitting Method

In this section we propose the proximal subgradient splitting method extending the classical subgradient iteration. We prove that the sequence of points generated by the proposed method converges weakly to a solution of (1) using different strategies for choosing of the stepsizes. Moreover, we show the complexity analysis for the generated sequence.

The method is formally stated as follows:

IfPSS Methodstops at stepk, thenx^k=prox_α

kg

x^k−α_ku^k

withu^k ∈∂f (x^k), implying thatx^kis solution of problem (1). Then, from now on, we assume thatPSS Methodgener- ates an infinite sequence(x^k)_k∈N. Moreover, it follows directly from (6) that the sequence (x^k)_k∈Nbelongs todom(g).

Before the formal analysis of the convergence properties ofPSS Method, we discuss below about the necessity of taking a (forward) subgradient step off instead of another (backward) proximal step.

Remark 2.1 To evaluate the proximal operator off is necessary to solve a strongly convex minimization problem as (4). Thus, in the context of problem (1), we assume that it is hard to evaluate the proximal operator off, leaving out the possibility to use the standard and very powerful iteration so-calledDouglas-Rachford splitting methodpresented in [17].

Such situations appear mainly whenfhas a complicated algebraic expression and therefore it may impossibility to solve, explicitly or efficiently, subproblem (4). Indeed, very often in the applications, the formula for the proximity operator is not available in closed form and ad hoc algorithms or approximation procedures have be used to computeprox_αf. This happens for instance when applying proximal methods to image deblurring with total variation [5], or to structured sparsity regularization problems in machine learning and inverse problems [31]. A classical problem of the form of (1), when the subgradient off is easily available andprox_fdoes not has explicitly formula is the dual formulation of the following constrained convex problem:

minh₀(y) subject that h_i(y)≤0 (i=1, . . . , n), (7)

(6)

whereh_i :R^m→R(i=0, . . . , n)are convex. It has as dual problem

xmin∈Rⁿf (x)+δ_Rⁿ₊(x) withf :Rⁿ→Rdefined asf (x)= −inf_y∈R^m{h0(y)+n

i=1xihi(y)}. It is well-known that

∂f (x)=conv

h(yx):f (x)=h0(yx)+ n

i=1

xihi(yx)

,

and conv{S}denotes the convex hull of a set S. However, computeprox_f does not look an easy problem. This argument is used widely in the literature to motivated the projected subgradient method, which can be easily modified for recovering problems as (1), wheng is not necessary the indicator function. Indeed, consider problem (7) whenn=mwith an additional and simple restrictiong₀, that is:y∈Rⁿ

minh₀(y) subject that h_i(y)≤0 (i=1, . . . , n), g₀(y)≤0, (8) which can be rewritten now as min_x∈Rⁿf (x)+δ_Rⁿ₊(x)+λg₀(x), λ >0,where

f (x)= − inf

y∈Rⁿ{h₀(y)+ n

i=1

x_ih_i(y)}.

This last problem is a particular case of (1), by takingg = δ_Rⁿ₊+λg0. Note that if dom(g0)⊆Rⁿ₊theng=λg0.

Thus,PSS Methoduses the proximal operator ofgand the explicit subgradient iteration off(i.e., the proximal operator off is never evaluated), which is, in general, much easier to implement than the proximal operator off+gorf, as happens in the standard proximal point iteration or the Douglas-Rachford algorithm, respectively for solving nonsmooth problems, as (1); see, for instance, [17]. Furthermore, note that in our case the subgradient iteration for the sumf+gis not possible, because the domains offandgare not the whole space.

In the following we prove a crucial property of the iterates generated byPSS Method.

Lemma 2.1 Let(x^k)_k∈Nand(u^k)_k∈Nbe the sequences generated byPSS Method. Then, for allk∈Nandx∈dom(g),

x^k+1−x²≤ x^k−x²+2α_k

(f+g)(x)−(f+g)(x^k) +α_k²u^k+w^k², wherew^k∈∂g(x^k)is arbitrary.

Proof Take anyx ∈ dom(g). Note that (5) and (6) imply thatw¯^k+1 := x^k−x^k⁺¹ α_k −u^k, withu^k∈∂f (x^k)as defined byPSS Method, belongs to∂g(x^k+1). Then,

α²_ku^k+ ¯w^k+1²+ x^k−x²− x^k+1−x = x^k+1−x^k²+ x^k−x²− x^k+1−x²

=2x^k−x^k⁺¹, x^k−x =2α_ku^k, x^k−x +2x^k−x^k⁺¹−α_ku^k, x^k−x

=2α_k u^k, x^k−x +2α_k

x^k−x^k⁺¹

αk −u^k, x^k+1−x

+2α_k

x^k−x^k⁺¹

αk −u^k, x^k−x^k+1

=2α_k u^k, x^k−x +2α_k

w¯^k+1, x^k+1−x

+2α_ku^k, x^k+1−x^k +2x^k−x^k+1².

(7)

Now using again that x^k−x^k⁺¹

α_k −u^k = ¯w^k+1∈∂g(x^k+1)and the convexity ofgandf, the above equality leads us

2x^k−x^k+1, x^k−x ≥2αk

f (x^k)−f (x)+g(x^k+1)−g(x)+ u^k, x^k+1−x^k +2x^k−x^k+1²

=2αk

(f+g)(x^k)−(f+g)(x)+g(x^k⁺¹)−g(x^k)+ u^k, x^k⁺¹−x^k +2α²_ku^k+ ¯w^k⁺¹²

≥2αk

(f+g)(x^k)−(f+g)(x)+ w^k+u^k, x^k⁺¹−x^k +2α_k²u^k+ ¯w^k⁺¹², for anyw^k ∈∂g(x^k). We thus have shown that

x^k⁺¹−x²≤ x^k−x²+2αk

(f +g)(x)−(f +g)(x^k) +2α_k² u^k+w^k, u^k+ ¯w^k+1 −α_k²u^k+ ¯w^k+1²

= x^k−x²+2αk

(f+g)(x)−(f+g)(x^k) +α_k²u^k+w^k²−α_k²w^k− ¯w^k⁺¹². Note thatw^k∈∂g(x^k)is arbitrary and the result follows.

Since subgradient methods are not descent methods, as the proposed method here, it is common to keep track of the best point found so far,i.e., the one with minimum function value among the iterates. At each step, we set it recursively as(f +g)⁰_best:=(f +g)(x⁰) and

(f +g)^k_best:=min

(f +g)^k−1_best, (f +g)(x^k)

, (9)

for allk. Since

(f+g)^k_best

k∈Nis a decreasing sequence, it has a limit (which can be−∞).

When the functionfis differentiable and its gradient Lipschitz continuous, it is possible to prove the complexity of the iterates generated byPSS Method; see [35]. In our instance (f is not necessarily differentiable) we expect, of course, slower convergence.

Next we present a convergence rate result for the sequence of the best functional values (f +g)^k_best

k∈Nto min{(f+g)(x):x∈H}. Lemma 2.2 Let

(f +g)^k_best

k∈Nbe the sequence defined by(9). IfS_∗ = ∅then, for all k∈N,

(f +g)^k_best−min

x∈H(f +g)(x)≤[dist(x⁰, S_∗)]²+C_kk i=0α_i² 2k

i=0α_i ,

whereCk:=max

uⁱ+wⁱ² :0≤i≤k

withwⁱ ∈∂g(xⁱ) (i=0, . . . , k)are arbitrary.

Proof Definex_∗ := P_S_∗(x⁰). Note that x_∗ exists becauseS_∗ is a nonempty closed and convex set ofH. By applying Lemma 2.1,k+1 times, fori∈ {0,1, . . . , k}atx_∗∈S_∗, we get

x^k⁺¹−x_∗²≤ x^k−x_∗²+2αk

(f +g)(x_∗)−(f+g)(x^k) +α_k²u^k+w^k²

≤ x⁰−x_∗²+2 k i=0

αi

(f +g)(x_∗)−(f +g)(xⁱ) + k i=0

α_i²uⁱ+wⁱ²

≤ [dist(x⁰, S_∗)]²+2

xmin∈H(f +g)(x)−(f+g)^k_best k

i=0

α_i+C_k k

i=0

α_i², (10) where(f+g)^k_bestis defined by (9) and the result follows after simple algebra.

(8)

Next we establish the rate of convergence of the rate of the objective values at the ergodic sequence(x¯^k)_k∈Nof(x^k)_k∈N, which is defined recursively asx¯⁰ =x⁰and givenσ0 =α0

andσk=σk−1+αk, we define

¯ x^k=

1−αk

σk

¯

x^k−1+αk

σk

x^k.

After easy induction, we haveσk=_k

i=0αi and

¯ x^k= 1

σk

k i=0

αixⁱ, (11)

for allk∈N.

The following result is similar to Lemma 2.2, considering now the ergodic sequence defined by (11).

Lemma 2.3 Let(x¯^k)_k∈Nbe the ergodic sequence defined by(11). IfS_∗= ∅, then (f +g)(x¯^k)−min

x∈H(f +g)(x)≤[dist(x⁰, S_∗)]²+Ck_k

i=0α²_i 2k

i=0αi

,

whereCk=max

uⁱ+wⁱ²:0≤i≤k

withwⁱ∈∂g(xⁱ) (i=0, . . . , k)are arbitrary.

Proof Proceeding as in the proof of Lemma 2.2 until inequality (10) and after dividing by 2k

i=0αi, we get k

i=0

αi

σk

(f+g)(xⁱ)−min

x∈H(f+g)(x)

≤ 1 2σk

[dist(x⁰, S_∗)]²− x^k⁺¹−x_∗² + Ck

2σk

k i=0

α²_i

≤ 1 2σk

[dist(x⁰, S_∗)]²+Ck

k i=0

α²_i

, (12)

whereσk :=_k

i=0αi. Using the convexity off +gafter note that_σ^αⁱ

k ∈ [0,1]for alli ∈ {0,1, . . . , k}andk

i=0 αi

σ_k =1 and (11) in the above inequality (12), the result follows.

Next we focus on constant stepsizes, which is motivated by the fact that we are interested in quantifying the progress of the proposed method to find an approximate solution.

Corollary 2.4 Let(x^k)k∈Nbe the sequence generated byPSS Methodwith the stepsizes α_kconstant equal toα,

(f +g)^k_best

k∈Nbe the sequence defined by(9)and(x¯^k)_k∈Nbe the ergodic sequence as(11). Then, the iteration attains the optimal rate atα = ^dist(x^√_C⁰^,S^∗⁾

k ·

√1

k+1, i.e., for allk∈N, (f +g)^k_best−min

x∈H(f +g)(x)≤[dist(x⁰, S_∗)]²+α²(k+1)C_k

2(k+1)α ≤ dist(x⁰, S_∗)·√ C_k

√k+1 and

(f+g)(x¯^k)−min

x∈H(f +g)(x)≤[dist(x⁰, S_∗)]²+α²(k+1)C_k

2(k+1)α ≤dist(x⁰, S_∗)·√ C_k

√k+1 , whereC_k=max

uⁱ+wⁱ²:0≤i≤k

withwⁱ∈∂g(xⁱ) (i=0, . . . , k)are arbitrary.

(9)

Proof If we consider constant stepsizes,i.e.,α_k = αfor allk ∈ N, then the optimal rate is obtained whenα = ^dist(x^√_C⁰^,S^∗⁾

k · ^√_k+1¹ from minimizing the right part of Lemmas 2.2 and 2.3.

Note that under AssumptionA2,Ck≤(max1≤i≤kuⁱ+ρ)². Hence when∂f is bounded on thedom(g)(which occurs whendom(g)is bounded over our assumptions), Assump- tionA1implies thatC_k ≤ (ζ +ρ)²for allk ∈ N. In this case, our analysis showed that the expected error of the iterates generated byPSS Methodwith constant stepsizes after k iterations isO

(k+1)^−1/2

. Hence, we can search an ε-solution of problem (1) with O

ε⁻²

iterations. Of course, this is worse than the rateO(k⁻¹)andO ε⁻¹

iterations of the proximal forward-backward iteration for the differentiable and convexfwith Lipschitz continuous gradient; see, for instance, [35]. However, as was showed in Section 3.2.1, The- orem 3.2.1 of [37], the worst expected error afterkiterations of the classical subgradient iteration is attainable equal toO

(k+1)^−1/2

for general nonsmooth problems.

2.1 Exogenous Stepsizes

In this subsection we analyze the convergence ofPSS Methodusing exogenous stepsizes, i.e., the positive exogenous sequence of stepsizes(αk)k∈N satisfies thatαk = βk

ηk

where ηk:=max{1,u^k}for allk, and

∞ k=0

β_k²<+∞ and ∞ k=0

β_k= +∞. (13)

We begin with a useful consequence of Lemma 2.1.

Corollary 2.5 Letx∈dom(g). Then, for allk∈N, x^k+1−x²≤ x^k−x²+2βk

ηk

(f +g)(x)−(f +g)(x^k) +

1+2ρ+ρ²

β_k²,

whereρ≥0is as defined in AssumptionA2.

Proof The result follows by noting thatηk ≥ u^k,ηk ≥ 1 for allk ∈ N and letting w^k ∈∂g(x^k)such thatw^k ≤ρfor allk∈Nin view of AssumptionA2. Then,

u^k+w^k²

η_k² ≤ u^k²

η_k² +2u^kw^k

η_k² +w^k²

η²_k ≤1+2ρ+ρ². Now, Lemma 2.1 implies the desired result.

Now we define the auxiliary set S_lev(x⁰):=

x∈dom(g):(f +g)(x)≤(f+g)(x^k), ∀k∈N

. (14)

When the solution set of problem (1) is nonempty,S_lev(x⁰) = ∅becauseS_∗ ⊆ S_lev(x⁰).

Next, we prove the two main results of this subsection.

Theorem 2.6 Let (x^k)_k∈N be the sequence generated by PSS Method with exogenous stepsizes. If there existsx¯∈S_lev(x⁰), then:

(10)

(a) The sequence(x^k)_k∈Nis quasi-Fej´er convergent to

Lf+g(x)¯ := {x∈dom(g):(f+g)(x)≤(f +g)(x)¯ }. (b) lim_k→∞(f +g)(x^k)=(f +g)(x).¯

(c) The sequence(x^k)_k∈Nis weakly convergent to somex˜∈Lf+g(x).¯

Proof By assumption there existsx¯∈Slev(x⁰),i.e.,(f+g)(x)¯ ≤(f+g)(x^k), for allk∈N. (a) To show that (x^k)k∈N is quasi-Fej´er convergent to Lf+g(x)¯ (which is nonempty becausex¯∈Lf+g(x)), we use Corollary 2.5, for any¯ x∈Lf+g(x)¯ ⊆dom(g), estab- lishing thatx^k⁺¹−x²≤ x^k−x²+(1+2ρ+ρ²)β_k², for allk∈N. Thus,(x^k)_k∈N is quasi-Fej´er convergent toLf+g(x).¯

(b) The sequence (x^k)_k∈N is bounded from Fact 1.1a, and hence it has accumulation points in the sense of the weak topology. To prove that

k→∞lim (f +g)(x^k)=(f +g)(x),¯ (15) we use Corollary 2.5, withx= ¯x∈Lf+g(x)¯ ⊆dom(g), to get

βk

(f +g)(x^k)−(f +g)(x)¯ ≤ ¹₂(x^k− ¯x²− x^k⁺¹− ¯x²)+¹₂(1+2ρ+ρ²)β_k². Summing, fromk=0 tom, the above inequality, we have

m k=0

βk

(f+g)(x^k)−(f+g)(x)¯ ≤ ¹₂(x⁰− ¯x²− x^m+1− ¯x²)+¹₂(1+2ρ+ρ²) m k=0β_k², and taking limit, whenmgoes to∞,

∞ k=0

β_k

(f +g)(x^k)−(f +g)(x)¯ <+∞. (16) Then, (16) together with (13) implies that there exists a subsequence (f +g)(xⁱ^k)

k∈Nof

(f+g)(x^k)

k∈Nsuch that lim inf

k→∞

(f +g)(xⁱ^k)−(f+g)(x)¯ =0. (17) Indeed, if (17) does not hold, then there exist σ > 0 and k ≥ ˜k, such that(f + g)(x^k)−(f +g)(x)¯ ≥σand using (16), we get

+∞>

∞ k=˜k

βk

(f +g)(x^k)−(f+g)(x)¯ ≥σ ∞ k=˜k

βk,

in contradiction with (13). Next, defineϕ_k :=(f +g)(x^k)−(f +g)(x), which is¯ positive for allkbecausex¯∈S_lev(x⁰). Then, for anyu^k∈∂f (x^k)andw^k ∈∂g(x^k), we get

ϕk−ϕk+1 = (f+g)(x^k)−(f +g)(x^k⁺¹)≤ u^k+w^k, x^k−x^k⁺¹

≤ u^k+w^kx^k−x^k+1 ≤(ζ +ρ)x^k−x^k+1, (18) whereζ >0 such thatu^k ≤ζ, for allk∈N(ζ exists in virtue of the boundedness of(x^k)_k∈N and AssumptionA1) andw^k ≤ ρ, for allk ∈ N(ρ exists because w^k∈∂g(x^k)are arbitrary and the use of AssumptionA2). Using Corollary 2.5, with

(11)

x=x^k, we havex^k−x^k⁺¹ ≤

1+2ρ+ρ²·βk, which together with (18) implies that

ϕ_k−ϕ_k+1≤

1+2ρ+ρ²·(ζ+ρ)β_k:= ¯ρβ_k (19) for allk ∈ N. From (17), there exists a subsequence

ϕ_i_k

k∈Nof(ϕ_k)_k∈Nsuch that lim_k→∞ϕ_i_k =0. If the claim given in (15) does not hold, then there exists someδ >0 and a subsequence

ϕ_k

k∈Nof(ϕk)_k∈N, such thatϕ_k≥δfor allk∈N. Thus, we can construct a third subsequence

ϕj_k

k∈Nof(ϕk)_k_∈N, where the indicesjkare chosen in the following way:

j₀:=min{m≥0|ϕ_m≥δ}, j2k+1:=min{m≥j2k|ϕm ≤δ/2}, j_2k+2:=min{m≥j_2k+1|ϕ_m≥δ}, for eachk. The existence of the subsequences

ϕi_k

k∈N, ϕ_k

k∈Nof(ϕk)_k∈N, guarantees that the subsequence

ϕ_j_k

k∈Nof(ϕ_k)_k∈Nis well-defined for allk≥0. It follows from the definition ofjkthat

ϕ_m≥δ for j_2k≤m≤j_2k+1−1 (20) ϕ_m≤ δ

2 for j_2k+1≤m≤j_2k+2−1 for allk, and hence

ϕj_2k−ϕj_2k+1 ≥δ

2, (21)

for allk∈N. In view of (16) and remind thatϕ_k =(f +g)(x^k)−(f +g)(x)¯ ≥0 for allk∈N,

+∞ >

∞ k=0

βkϕk≥ ∞ k=0

j_2k+1−1 m=j2k

βmϕm≥ δ 2

∞ k=0

j_2k+1−1 m=j2k

βm

= δ 2ρ¯

∞ k=0

j_2k+1−1

m=j_2k

¯ ρβ_m ≥ δ

2ρ¯ ∞ k=0

j_2k+1−1

m=j_2k

(ϕ_m−ϕ_m+1)= δ 2ρ¯

∞ k=0

(ϕ_j_2k−ϕ_j_2k+1)

≥ δ 2ρ¯

∞ k=0

δ

2 = +∞,

where we have used (20) in the second inequality and (19) in the third inequality and (21) in the last one. Thus, lim

k→∞(f+g)(x^k)=(f+g)(x), establishing¯ (b).

(c) Letx˜be a weak accumulation point of(x^k)_k∈N, and note thatx˜exists by Item(a)and Fact 1.1a. From now on, we use(xⁱ^k)_k∈Nto denote any subsequence of(x^k)_k∈Nthat converges weakly tox. Since˜ f+gis weakly lower semicontinuous, using (15), we get

(f +g)(x)˜ ≤lim inf

k→∞(f+g)(xⁱ^k)= lim

k→∞(f+g)(x^k)=(f+g)(x),¯ implying that (f +g)(x)˜ ≤ (f +g)(x)¯ and thus x˜ ∈ Lf+g(x). As consequence, all¯ weak accumulation points of(x^k)_k∈Nbelong toLf+g(x)¯ and since(x^k)_k∈Nis quasi-Fej´er convergent to Lf+g(x), we get that¯ (x^k)_k∈N converges weakly to x˜ ∈ Lf+g(x)¯ from Fact 1.1b.

Theorem 2.7 Let (x^k)_k∈N be the sequence generated by PSS Method with exogenous stepsizes. Then,

(12)

(a) lim inf_k→∞(f +g)(x^k)=inf_x∈H(f+g)(x)=s_∗(possiblys_∗= −∞).

(b) IfS_∗ = ∅, thenlim_k→∞(f +g)(x^k) =min_x∈H(f +g)(x)and(x^k)_k∈Nconverges weakly to somex¯∈S_∗.

(c) IfS_∗= ∅, then(x^k)_k∈Nis unbounded.

Proof (a) Since(x^k)k∈N⊂dom(g), we gets_∗≤lim inf_k→∞(f +g)(x^k). Suppose that s_∗<lim infk→∞(f +g)(x^k). Hence, there existsxˆsuch that

(f +g)(x) <ˆ lim inf

k→∞(f +g)(x^k). (22)

It follows from (22) that there existsk¯∈Nsuch that(f+g)(x)ˆ ≤(f+g)(x^k)for all k≥ ¯k. Sincek¯is finite we can assume without loss of generality that(f+g)(x)ˆ ≤ (f+g)(x^k)for allk∈N. Using the definition ofS_lev(x⁰), given in (14), we have that

ˆ

x∈Slev(x⁰). By Theorem 2.6b lim_k→∞(f +g)(x^k)=(f +g)(x), in contradictionˆ with (22).

(b) SinceS_∗ = ∅, take x_∗ ∈ S_∗ and note that this impliesLf+g(x_∗) = S_∗. Since (x^k)_k∈N⊂dom(g), we get(f +g)(x_∗)≤(f +g)(x^k)for allk∈Nimplying that x_∗∈S_lev(x⁰). By applying items (b) and (c) of Theorem 2.6, atx¯ =x_∗, we get that lim_k→∞(f +g)(x^k)=(f +g)(x_∗)and(x^k)_k∈Nconverges weakly to somex˜ ∈S_∗, respectively.

(c) Assume thatS_∗is empty but the sequence(x^k)k∈Nis bounded. Let(x^k)k∈Nbe a subsequence of(x^k)_k∈Nsuch that lim_k→∞(f+g)(x^k)=lim inf_k→∞(f+g)(x^k). Since (x^k)_k∈Nis bounded, without loss of generality (i.e., refining(x^k)_k∈Nif necessary), we may assume that(x^k)_k∈Nconverges weakly to somex¯ ∈dom(g). By the weak lower semicontinuity off+gondom(g),

(f +g)(x)¯ ≤lim inf

k→∞(f+g)(x^k)= lim

k→∞(f+g)(x^k)=lim inf

k→∞(f+g)(x^k)=s_∗, (23) using Item(a)in the last equality. By (23),x¯∈S_∗, in contradiction with the hypothesis and the result follows.

For exogenous stepsizes, Theorem 2.7a guarantees the convergence of

(f +g)(x^k)

k∈N

to the optimal value of problem (1),i.e., lim inf_k→∞(f+g)(x^k)=s_∗,implying the convergence of

(f +g)^k_best

k∈N, defined in (9), tos_∗. It is important to mention that in the proof of the above two crucial results, we have used a similar idea recently presented in [7] for a different instance.

In the following we present a direct consequence of Lemmas 2.2 and 2.3, when the stepsizes satisfy (13).

Corollary 2.8 Let(x¯^k)k∈Nbe the ergodic sequence defined by(11)and(βk)_k∈Nas(13). If S_∗= ∅, then, for allk∈N,

(f +g)^k_best−min

x∈H(f +g)(x)≤ζ[dist(x⁰, S_∗)]²+(1+2ρ+ρ²)k i=0β_i² 2k

i=0β_i