• Nenhum resultado encontrado

Variational studies and replica symmetry breaking in the generalization problem of the binary perceptron

N/A
N/A
Protected

Academic year: 2021

Share "Variational studies and replica symmetry breaking in the generalization problem of the binary perceptron"

Copied!
9
0
0

Texto

(1)

Variational studies and replica symmetry breaking in the generalization problem

of the binary perceptron

Evaldo Botelho,1Cristiano R. Mattos,1,2and Nestor Caticha1

1Instituto de Fı´sica da Universidade de Sa˜o Paulo, Caixa Postal 66318, Sa˜o Paulo, SP 05315-970, Brazil 2Faculdade de Engenharia, Universidade Estadual Paulista, Caixa Postal 205, Guaratingueta´, SP 12500-000, Brazil

共Received 29 May 2000兲

We analyze the average performance of a general class of learning algorithms for the nondeterministic polynomial time complete problem of rule extraction by a binary perceptron. The examples are generated by a rule implemented by a teacher network of similar architecture. A variational approach is used in trying to identify the potential energy that leads to the largest generalization in the thermodynamic limit. We restrict our search to algorithms that always satisfy the binary constraints. A replica symmetric ansatz leads to a learning algorithm which presents a phase transition in violation of an information theoretical bound. Stability analysis shows that this is due to a failure of the replica symmetric ansatz and the first step of replica symmetry breaking共RSB兲 is studied. The variational method does not determine a unique potential but it allows con-struction of a class with a unique minimum within each first order valley. Members of this class improve on the performance of Gibbs algorithm but fail to reach the Bayesian limit in the low generalization phase. They even fail to reach the performance of the best binary, an optimal clipping of the barycenter of version space. We find a trade-off between a good low performance and early onset of perfect generalization. Although the RSB may be locally stable we discuss the possibility that it fails to be the correct saddle point globally.

PACS number共s兲: 87.10.⫹e, 84.35.⫹i, 89.70.⫹c, 05.50.⫹q

I. INTRODUCTION

One topic of interest in the study of neural networks with the tools of statistical mechanics共SM兲 is that the process of information extraction from data can be modeled by a dy-namical process of minimization of an energy function in the presence of noise. The use of techniques borrowed from the study of equilibrium disordered systems关1兴, such as, for ex-ample, replica methods, Thouless, Anderson, and Palmer equations, cavity analysis, and Monte Carlo simulations, as well as dynamical analysis techniques, has permitted a con-siderable understanding of typical properties of the thermo-dynamic limit共TL兲 of such systems as attractor and feedfor-ward neural networks. In this paper we are interested in the equilibrium properties of the student-teacher problem关2–4兴 of rule extraction by a binary Boolean perceptron. This means that the weights as well as the output are restricted to be⫾1. We look at the generalization ability for the case of a realizable rule represented by a teacher of the same archi-tecture as the student and in particular we concentrate on the determination of thermodynamic limit bounds and how to get them by the construction of an appropriate potential.

The binary perceptron has been attacked from many fronts. Studies of the computational complexity by Pitt and Valiant have shown that it belongs to the NP 共nondetermin-istic polynomial time兲 complete class 关5兴. From a SM point of view it has been studied before by Gardner and Derrida 关6兴 and Gyo¨rgyi 关7兴. Their aim was to study the learning curve 共generalization error兲 as a function of the number of examples where the training energy was chosen in the sim-plest way, i.e., error counting. For independent examples drawn from a uniform distribution, the main characteristic of this system in the TL is the first order transition to perfect generalization at ␣c⯝1.25, where as usual ␣⫽P/N

mea-sures the number of examples P in units of the number of inputs N. This shows the power of statistical mechanics methods, since the transition cannot be detected by, e.g., Vapnik-Chervonenkis analysis, which aims at a general type of result, such as those independent of the distribution of examples, and can therefore miss important features that ap-pear only for particular but important cases. For a specific comment on this, see 关8兴

A relevant question, complementary in spirit to examina-tion of the thermal equilibrium properties, delves into the dynamical aspects of actually determining the set of weights that minimize the training error. The studies of Horner 关9兴 have shown that times exponential in N are required for learning. This is in agreement with the expectation that, since this is a computationally hard problem, it should have spin-glass properties. Polynomial time algorithms, such as simu-lated annealing of the error-counting energy, with a linear decrease of the temperature schedule, fail to reach global solutions.

Different approximations have been devised in order to overcome the problems associated with glassy dynamics. Based on the fact that learning the equivalent real-weights problem is not as slow, other studies have dealt with continu-ous approximations to the binary problem. Several groups looked into clipping strategies关10–13兴 and other transforma-tions of the continuous perceptron, in particular one that op-timally incorporates the information that the teacher weight vector points in the direction of a vertex of the unit N-dimensional hypercube关14兴. Penney and Sherrington 关15兴 have looked into how to reduce the effective dimension of the problem by clipping a partial set of real couplings and posterior learning of the remaining binary weights. Copelli et al.关16–18兴 have looked at a related problem in the unsu-pervised learning scenario. They employed their methods for the teacher-student case also. By extending an argument of PRE 62

(2)

Watkin for the analogous problem with continuous weights, they looked into the generalization properties of the center of mass of the version space. Its performance should saturate the Bayes limit, but it is not itself a vertex on the hypercube. They also considered the Bayesian best binary 共BB兲 vector by clipping the real component Bayes vector for both super-vised and unsupersuper-vised learning.

On-line learning—a possibly efficient strategy to over-come slow dynamics since it makes no attempt to thermalize—has been shown to be ineffective if working in the binary coupling space directly 关19兴. Nevertheless, progress has been made in the direction of on-line learning. Solla and Winther 关20兴 have shown how to incorporate, in the on-line Bayesian spirit suggested by Opper 关21兴, infor-mation about the binary nature of the couplings. Their method leads to fast learning times but is not able to con-dense on the exact solution in finite times.

In this work we extend the work of Kinouchi and Caticha 关22兴 for the off-line variational determination of a training energy. We look into the extension of the variational ap-proach to the determination of training potentials, optimal—in the generalization sense—for the binary percep-tron. A difference from the approach of Copelli et al. 关16– 18兴 is that our construction methods never leave the binary space. Although this in the end might turn out not to be a good strategy, it is theoretically interesting to see how far one can go on such a discrete problem without leaving the allowed space. In the case of random examples,关13兴 presents a variational calculation for potentials that, using the maxi-mum stability perceptron as a teacher, aims at determining the continuous vector that on clipping determines the largest number of weights of the maximum stability binary percep-tron.

The extension of the variational program to the binary case is not at all straightforward due to the failure of the replica symmetry共RS兲 ansatz. We studied first a replica sym-metric variational calculation. The transition to perfect learn-ing occurs at a much smaller value of␣c(⯝0.5). This would have been great news if not for the fact that a simple infor-mation theoretical bound for perfect generalization is ␣c ⭓1 关23兴. Each example cannot carry more than just one bit of information and at least N examples are needed to deter-mine the N independent bits encoded in the weight vector. The stability analysis showed the surprising failure of the RS ansatz from the very start at ␣⫽0. The next step we report here is to perform a variational calculation at the level of a one-step replica symmetry breaking共RSB-1兲. The variational method does not determine a unique potential, but only fixes expected values of the modulation function. We study a re-stricted set of solutions, which form a one-parameter family. Our choice is based on physical similarities of this family to potentials already studied. This is interesting in itself as it indicates the possibility that potentials other than the type we study may be relevant. The phase diagrams are studied and discussed. The learning curves associated with this family show a trade-off between better generalization and earlier transitions to the perfect generalization phase.

This paper is organized as follows. In Sec. II A we present the general replica calculation results. In Secs. II B and II C the variational procedures under RS and RSB-1, respec-tively, are presented. Section III discusses the analysis of the

RSB-1 free energy. This requires the determination of the effective potential, which is explicitly calculated in the Ap-pendix. Section IV discusses results and presents conclu-sions.

II. OFF-LINE VARIATIONAL METHOD A. General results

The statistical mechanics approach to determining the generalization ability of a network learning from examples a rule that itself is implemented by another network has been studied in several cases; for reviews, see 关2–4兴. Call the N-dimensional binary weight vectors of the teacher and stu-dent networks B and J, respectively.

Off-line learning describes the infinite time limit of a learning algorithm that proceeds by minimization, in the presence of thermal noise, of an energy function 兺␮⫽1P V(J•S␮,␴␮) that depends on quenched data repre-sented by the P input-output example pairs (S,␴). The free energy for the replicated system presented with P ex-amples, independently chosen from a uniform distribution, is given by ⫺␤f⫽extrqab,qˆab,Ra,Rˆa关Goqab,qˆab,Ra,Rˆa其 ⫺␣Grqab,Ra其兴, 共1兲 where Go⫽ lim n→0 1 n

a

⬍b qabqˆab

a RaRˆaN1

j⫽1 N ln

a d共Jaj兲 ⫻exp

a RˆaBjJj a1 2 a

⫽b qˆabJj a Jbj

, 共2兲 Gr⫽ lim n→0 1 nln

Dt2

a dad␭ˆa 2␲

⫻exp

⫺␤

a V共␭a,t2兲

⫻exp

i

a ␭ˆa共␭a⫺Ra t2兲⫺1 2

a,b 共qab⫺Ra Rb兲␭ˆa␭ˆb

. 共3兲 The interpretation of R and q at the extreme justifies the choice of method to study this problem. This is so since R

J•B/N

, where the average is both thermal and over the disorder, is directly related to the generalization error eG in the case of equivalent replicas. For the uniform distribution, eG⫽arccos(R)/␲. The usual order parameters qab

Ja•Jb/N

describe the typical overlap between two stu-dents learning from the same data. The weight measure d(Jja) defines the particular problem. In this case d(Jaj) is the counting measure over Jaj⫽⫾1.

(3)

Two related questions are discussed in this paper: What is the largest value that the overlap R can reach for fixed ␣? That is, what are the upper bounds for the generalization ability? Note that this is different in spirit from a Bayesian approach since here we ask for best performances within the constraints of a given architecture. This immediately leads to the second question: What energy function V should be cho-sen in order to achieve those bounds? The same questions have been answered satisfactorily for this machine with real weights; the first by Opper and Haussler关24兴 and the second by Kinouchi and Caticha关22兴. In trying to answer them we have to go beyond the permutation replica symmetry that holds in that case, which we now analyze.

B. Replica symmetric analysis

It is reasonable to begin by assuming that the solution to the extremization problem is replica symmetry. The zero temperature limit can be taken by noticing that an optimal potential will most likely have a unique minimum and there-fore it is natural to take the q→1 limit 关25兴. A sensible limit can be obtained by requiring that x⬅␤(1⫺q), y⬅qˆ/␤2, and

⬅rˆ/␤ remain finite as ␤→⬁. The free energy is then the extreme of fRS⫽ 1 2xy⫹兵R⫺关1⫺2H共␩兲兴其␻⫺

2 ␲y e⫺␩ 2/2 ⫹2␣

DtH

⫺tR

1⫺R2

V共␭o兲⫹ 共␭o⫺t兲2 2x

, where␩⫽␻/

y ,Dt⫽exp(⫺t2/2)dt/

2␲, H(x)⫽兰xDt, ando(t) is defined as the minimum of V(␭)⫹(␭⫺t)2/2x:

⫺x

V共␭兲

o

⫽␭o⫺t. 共4兲

The saddle point equations lead to

R⫽1⫺2H共兲,x⫽

2 ␲ye ⫺␩2/2 , e⫺␩2⫽␲␣

DtH

⫺tR

1⫺R2

共␭o⫺t兲 2,e⫺␩2/2⫽␣

dt

2␲共1⫺R2兲e ⫺t2/2(1⫺R2)o. 共5兲

Following关22兴, we obtain the result that, in order to maxi-mize R, the learning potential should be chosen such that the RS modulation function FRS(t)⬅␭o⫺t is given by

FRS共t兲⫽C

e⫺共Rt)2/2(1⫺R2)

H共⫺Rt/

1⫺R2兲, 共6兲 which can be seen to be very similar to the optimal modula-tion funcmodula-tion for on-line learning in the real-weights percep-tron, a most interesting result from a cavity perspective 关26,27兴. It is easy to see that

Ce

⫺␩2/2

␲␩

1⫺R2 共7兲

is a function of R but not of t. From Eq.共6兲 we can determine the potential V, since FRS⫽⫺xV/⳵␭o. Thus

V共␭兲⫽E共t兲⫺ 1 2xFRS

2 共t兲, 共8兲

where E(t)⫽⫺关

2␲(1⫺R2)C/Rx兴 ln H(⫺Rt/

1⫺R2) is proportional to the on-line optimal energy function. It re-sembles the equivalent off-line potential for the real-weights perceptron. However, the post-training stability may not be all positive, i.e., ␭o can be negative. This means that this algorithm would be called inconsistent since after training there still may be memorization errors. The learning curves 共see Fig. 2 below兲 are obtained by solving the saddle point equations numerically. By comparing the free energy of the phase with incomplete learning R⬍1 with that for R⫽1, a first order phase transition at ␣c⫽0.53 is found. As men-tioned earlier, this is unacceptable and the cause of this un-physical result is the replica symmetric ansatz. A stability analysis 关28兴 confirms this suspicion and we now turn to a one-step replica symmetry breaking ansatz

C. RSB-1 variational method

We proceed in the by now standard way关1兴. Let qabbe an l⫻l block matrix and call the block indices ␮,␯ ⫽1,2, . . . ,l⫽n/m. For replicas within the same block, the overlaps are taken to have the same overlap value q1, while replicas belonging to different blocks have overlap qo. The replicas are assumed to be equivalent and thus Ra⫽␳, for all replicas. We do not use the same letter R to make explicit that it will have a different value from the previous section. The conjugated parameters共with carets兲 are assumed to have the same structure.

The entropic and energetic contributions to the free en-ergy are Go⫽ 1 2关mqoqˆo⫹共1⫺m兲q1qˆ1⫺qˆ1兴⫺␳␳ˆ⫹ 1 m

Dzoln ⫻

Dz1关2 cosh共

qˆozo

1⫺qˆoz1⫹␳ˆ兲兴m, 共9兲 Gr⫽⫺ 2 m

⫺⬁Dt1

0 ⬁ Dt2ln

⫺⬁ ⬁ Dto

d

2␲共1⫺q1兲 ⫻exp

⫺␤

V共␭兲⫹共␭⫺t兲 2 2x1

冊册

m , 共10兲 where t

qo⫺␳2t1⫹␳t2⫹

q1⫺qoto.

We are interested in the zero temperature case. A sensible limit can be found by requiring that within a valley the op-timal potential has a unique minimum and that different val-leys’ minima have the same overlap. To achieve this we

(4)

make the ansatz ␤→⬁,m→0,q1→1, and qˆo and Rˆ→⬁, such that xo⬅␤m, x1⬅␤(1⫺q1),Qˆo⬅m2qˆo, and Rˆ⬅mˆ remain finite.

To take the limit关29,30兴 we start with the entropic part

Go⫽ 1 m

共1⫺qo兲Qˆo 2 ⫺␳

Dzoln 2 cosh共

Qˆozo⫹Rˆ兲

共11兲 and the energy part

Gr⫽⫺ 2 m

⫺⬁Dt1

0 ⬁ Dt2ln

⫺⬁ ⬁ Dtoe⫺␤m[V(␭o)⫹(␭o⫺t) 2/2x 1] 共12兲 where the ␭ integral is done by a saddle method and ␭o is such that ⳵V/⳵␭⫹共␭⫺t兲 x1 ⫽0 for ␭⫽␭o. 共13兲 It follows that ⫺xof⫽⫺ 共1⫺qo兲Qˆo 2 ⫺␳

Dzoln 2 cosh共

Qˆozo⫹Rˆ兲 ⫹2␣

⫺⬁ ⬁ Dt1

0 ⬁ Dt2ln

⫺⬁ ⬁ Dto ⫻e⫺xo[V(o)⫹(␭o⫺t)2/2x1]. 共14兲

The order parameters of the incomplete generalization phase are obtained by solving the saddle point equations:

f ⫽0⇒␳⫽

Dzotanh共

Q ˆ0zo⫹Rˆ兲⬅1共Qˆo,Rˆ兲, 共15兲 ⳵f0 ⫽0⇒qo

Dzotanh2共

0zo⫹Rˆ兲⬅␾2共Qˆo,Rˆ兲, 共16兲 ⳵f ⳵␳⫽0⇒Rˆ⫽2xo x1

Du e ⫺u2/2

2␲共1⫺␥2

共␭o⫺t兲

to, 共17兲 ⳵fqo⫽0⇒Qˆo⫽2␣ xo x1 2

Du H

u

o⫺t

to 2 , 共18兲 where␥⬅␳/

qo, ⌫⬅(1⫺␥2)/␥2, and

共•••兲

to

Dtoe⫺xo[V(o)⫹(␭o⫺t) 2/2x 1]共•••兲

Dtoe⫺xo[V(o)⫹(␭o⫺t) 2/2x 1] . 共19兲

The orthogonal transformation (t1,t2)→(u,v) defined by u ⫽(

qo⫺␳2t1⫹␳t2)/

qo, v⫽(⫺␳t1⫹

qo⫺␳2t2)/

qo, and t

qou

1⫺qoto was used to simplify Eq.共18兲.

Introducing an effective modulation function

F共u兲⫽

o⫺t

to, 共20兲 the saddle point equations can be written as

⫽2␣xo x1

Du e ⫺u2/2

2␲共1⫺␥2F共u兲, Qˆo⫽2␣

xo x1

2

Du H

u

F 2共u兲.

We can now determine the F(u) that maximizes, for a fixed ␣, the overlap ␳. This is somewhat harder than in the real-weights case, where the solution to the variational prob-lem is obtained very easily, just by inspection plus knowl-edge of the equivalent solution for the on-line problem. Nev-ertheless, the solution is very similar. So we look at the variations of Eq. 共15兲 with respect to F(u) subject to the constraint 共16兲. Let ␰ be a Lagrange multiplier 共see also 关31兴兲. F(u) can be determined by maximizing

⌽⫽␾1共Qˆo关qo,␳,F兴,Rˆ关qo,␳,F兴兲 ⫺␰␾2共Qˆo关qo,␳,F兴,Rˆ关qo,␳,F兴兲, 共21兲 which yields ␦QˆoF ⫽⫺ 共⳵␾1/⳵⫺␰⳵␾2/⳵兲 共⳵␾1/⳵Qˆo⫺␰⳵␾2/⳵Qˆo兲 ␦F⬅kF, 共22兲 and by using the explicit form of Rˆ and Qˆo and the defini-tions 共15兲 and 共16兲:

Du H

u

F共u兲⫽ x1 xo k 2

Du e⫺u2/2⌫

2␲共1⫺␥2兲. 共23兲 This does not determine F(u) uniquely. A possible solution is F共u兲⬅

o⫺t

t o⫽ 1 2 x1 xo k

2␲共1⫺␥2兲 e⫺u2/2⌫ H共⫺u/

⌫兲. 共24兲 Any odd function of u could be added to the Gaussian in the numerator, but we choose to look at this form since as it stands it is proportional to the optimal modulation function that appears in off-line and on-line optimization of the real-weights perceptron and we see no physical motivation for other terms. This choice 共24兲 leads to the saddle point equa-tions

2 ␲␣␬Io共␥兲, 共25兲 Qˆo⫽2␣␬2

1⫺␥2Io共␥兲, 共26兲 where ␬⬅1 2k/

2␲(1⫺␥ 2), I o(␥)⬅兰Dv g(⫺v),␥ ⬅␳/

qo, and g(x)⬅e⫺x 2/2

(5)

we do so in order to pick the largest possible ␳ for fixed␣. 共Fig. 1 shows ␣ as a function of kvar, the value of k that leads to the smallest generalization error.兲 Note that this lib-erty is due to the fact that the value of qois not constrained; only its form is constrained by Eq.共16兲. The learning curves of the low generalization phase can be determined numeri-cally 关see Figs. 2共a兲 and 2共b兲兴. In Fig. 3 the overlap qo is shown as a function of␣ for the best choice of the potential, i.e., using kvar. The determination of the thermodynamic phase, that is, the location of the first order transition, needs analysis of the free energy in both high and low generaliza-tion phases. While the free energy of the low generalizageneraliza-tion phase can be determined without explicit knowledge of the potential, that of the high generalization phase will require detailed knowledge of the potential. In the next section we show how this can be done.

III. THE FREE ENERGY AND THE POTENTIAL A. The low generalization phase

To complete the analysis of the learning curve we must look into the behavior of the free energy. This is not straight-forward since the form of the potential has not yet been determined, but only the expected value ⳵V/⳵␭. It is quite interesting, as we now show, that the precise form of the potential is not needed to determine the free energy or the learning curve in this phase; just knowledge of the effective modulation function suffices, and even this is the same for any solution of Eq.共23兲.

The energy contribution to the free energy can be written as ⫺xof1⫽2␣

DuH

⫺u

ln

Dto ⫻exp

⫺xo

V共␭o兲⫹ 共␭o⫺t兲2 2x1

冊册

,

provided the order parameters are understood as the solution of the saddle point equations. Integrating by parts, using the definitions of␭o and F(u),

⫺xof1⫽2␣ xo x1

qo

⫺⬁ ⬁ duF共u兲

⫺⬁ u D y H

⫺y

, substituting Eq.共24兲 for F(u), and integrating by parts again,

FIG. 1. The value of ␣ as a function of hyperparameter k at which the low generalization phase performance is optimal共thick line兲. For each fixed k, we show also the value ofs, the spinodal

point共short-long dash兲, where the low generalization phase ceases to exist, and of ␣c, the first order transition point 共long dash兲,

above which the perfect generalization phase has the lowest free energy.

FIG. 2. 共a兲 Learning curves. Generalization errors as functions of ␣ for different algorithms. The Bayes 共continuous line兲 and Gibbs 共top short dashed curve兲 algorithm learning curves are in-cluded for comparison. Note that the Bayes algorithm is obtained by a student outside the binary space. BB is the best binary vector 关18兴 obtained as the result of clipping the center of mass of the version space共Bayes兲. The learning curve 共dots兲 obtained under the hypothesis of replica symmetry has a transition to perfect learning at the impossible value of ␣⫽0.53. The algorithm obtained with k⫽2 gives the Gibbs result. For k⫽3 and k⫽6 the curves show a trade-off between earlier performance and later transition. For vari-able kvar(␣) 共Fig. 1兲 the best performance for potential based learn-ing is obtained. See共b兲 for details. For small␣ the unphysical RS, the RSB-1 with kvar, and the BB curves are numerically indistin-guishable.共b兲 Details of the learning curves. The black continuous line is obtained for kvar(␣) 共Fig. 1兲; it is the envelope, within the low generalization phase, of the family of algorithms obtained for k⭓2.

(6)

⫺xof1⫽2␣␬

2␲qo

⫺⬁ ⬁ DuH

⫺u

ln H

⫺u

共27兲 is obtained.

B. The high generalization phase

We now look at the value of the free energy in the ex-treme of the allowed interval for the order parameters. The determination of the potential cannot be postponed since the free energy depends on it explicitly.

Equation共23兲, which results from the variational prescrip-tion, can be transformed into an integral equation for the effective potentialE(t)⫽V(␭o)⫹(␭o⫺t)2/2x1:

Dtoe⫺xoE(t)

H

u

b , 共28兲 where t

qou

1⫺qoto and b

k 2q o 4␥2. 共29兲

An expression for the potential, obtained in the Appendix, is

E共t兲⫽⫺x1 o ln

Dx

H

t

⌫qo ⫺ix

共1⫺qo⌫qo

b , 共30兲 where the effect of the one-step RSB is seen to be the intro-duction of a noiselike term in the stability t, which depends on the existence of the other valleys. Its influence goes to zero as qo→1.

Despite its appearance, this expression is real, as can be seen by defining C共x,t兲⫽

DK

Kt

⌫qo

cos

xK

共1⫺qo⌫qo

, 共31兲 S共x,t兲⫽

DK

Kt

⌫qo

sin

xK

共1⫺qo⌫qo

. 共32兲 Finally we obtain E共t兲⫽⫺x1 o ln

dx

2␲exp⫺

1 2

1⫺b 1⫺qo ⌫qo

x2

⫻关C2共x,t兲⫹S2共x,t兲兴b/2cos

b tan⫺1S共x,t兲 C共x,t兲

, 共33兲 which can be used for numerical evaluations. It is easy to verify that the solution is real. At this point notice the struc-ture of the potential that emerges from the calculation. If qo⫽1, then we are back to the replica symmetric calculation; the x integral decouples and we are left with ⫺xoE(t) ⫽b ln H(⫺t/

⌫), which is very similar to the optimal poten-tials that have been found for the real-weights perceptron, both on line and off line, and for the binary perceptron with-out RSB. However RSB introduces a new noiselike element. One of the most striking features of this solution is that, like other variationally determined potentials, it depends on the values of the order parameters. This has been discussed elsewhere关32兴 and can be interpreted as the inclusion of the correct annealing along the learning process, generalizing the annealing of the learning rate that has been studied in on-line learning algorithms 共e.g., 关33,34,26兴兲. These order param-eters have the role of hyperparamparam-eters and will have the value determined by the solution of the saddle point equa-tions. The fact that these values may differ from those that are determined by the thermodynamics is of fundamental importance. The learning process thus proceeds in the fol-lowing way. Determine self-consistently the potential 共func-tional form and hyperparameters兲 that will lead to maximum performance in the low generalization phase in such a man-ner that the respective values of the order parameters are equal to the hyperparameters. Then minimize the potential by some dynamical process, letting the temperature go to zero. We do not worry here about thermalization times, as these might diverge for a NP problem. The final result may land on the perfect generalization phase and therefore the order parameters will not have the same values as the hyper-parameters. We denote the hyperparameters by starred quan-tities. Then we look at the limit qo,q1→1,␤→⬁,m→0 of Eq. 共14兲 with E*(t) calculated at the starred 共saddle point equation兲 values, which gives

f⫽2␣

0 ⬁

DtE*共t兲. 共34兲

C. Learning curves

Equations共15兲, 共16兲, 共25兲, and 共26兲 are used to build the low generalization learning curves. From the comparison of

FIG. 3. qo and␳ as a function of␣. Note that qo is different

from 1 as soon as ␣⫽0 but is so close that the learning curves under the RS and RSB-1 methods关see Fig. 2共a兲兴 are not very dif-ferent for small␣.

(7)

the free energies, Eqs. 共27兲 and 共34兲, the phase transition is located. These equations have to be complemented with a choice of k. We can just look at the numerical value of k that leads to the smallest generalization error for fixed ␣, which we called kvar(␣) 共Fig. 1兲. We look also at the results ob-tained for fixed k. A Bayesian statistician will not be sur-prised that the Bayes bounds are not beaten. Neither is the 共low phase兲 generalization error improved, nor the onset of the high generalization phase anticipated. These equations just lead in general to smaller errors than the Gibbs algo-rithm. However for small ␣ the RS, the RSB-1 关kvar(␣)兴, and the Bayesian best binary of Copelli et al. 关18兴 are very close. At this point this agreement is only numerical, but it is possible that these algorithms have the exact same optimal 共Bayesian兲 performance in the limit ␣→0, which is similar

to the results of 关18兴 for unsupervised learning.

Figure 4 shows the potential关Eq. 33兴 for different values of b. At b⫽1 the potential turns into the error-counting po-tential, which gives the Gibbs performance. Then replica symmetry is restored. Below the value b⫽1, the potential cannot be determined.

IV. DISCUSSION AND CONCLUSIONS

The learning curves shown in Fig. 2 show that the poten-tials obtained variationally fail to reach the Bayes bound. This is in contrast to the continuous weight perceptron, where the Bayes limit is obtained by a network with the same architecture as the teacher. As shown by Copelli et al. 关18兴 the Bayes algorithm is equivalent to a network with a weight vector given by the center of mass of the version space, which is not itself a binary vector. It follows that no method constrained to the hypercube will reach the Bayes limit. A similar failure to reach the Bayes limit was also reported by Winther et al. 关35兴 for multilayer networks, where again the Bayes algorithm cannot be matched within the space of students with the same architecture as the teacher.

The variational method probes potentials from within a restricted class and it is therefore natural not to expect to find Bayesian performances if the Bayes algorithm does not be-long to it. The performance of Bayesian inference restricted

to binary vectors is saturated only for small␣, but then even the simple Hebb algorithm has optimal generalization in this regime.

Within the low generalization phase the variational method is able to identify a class of potentials that lead to better performance than the Gibbs algorithm. There is a trade-off within the class, as can be seen in Figs. 2共a兲 and 2共b兲, between earlier transition to perfect generalization and better performance. A potential that leads to better generali-zation will have a delayed transition.

The potentials of this class do not work by imposing a zero memorization error, not even by minimization of the total error count. They tend to minimize the average overlap with the teacher, and since this is typically near the border 共just by the geometry of the N-dimensional space兲 the ap-proach to the teacher weight vector can be made through vectors outside the version space. The version space col-lapses to only one element at typically ␣⬃1.25, but the variational class, not considering the version space, will not detect this until later.

The replica symmetric variational study incorrectly pre-dicts a transition at␣⬍1. A stability analysis indicates that the replica symmetric ansatz is not adequate for all ␣⬎0. This is a little surprising, and the origin of the instability can be traced back to the requirement that ␤→⬁ implies q→1 for ␣⫽0. This is in contrast with the case of real weights where, even in the presence of multiplicative noise, learning with the variational potential is replica symmetric. The effect of noise can break phase space into disconnected regions, which are essentially ignored by the robust learning of the optimal potential, which disregards outliers. A method that insists on minimizing the memorization error will certainly lead in such conditions to a replica symmetry breaking situ-ation. The one-step broken replica symmetry leads to appar-ently consistent physical behavior. The stability analysis will be presented elsewhere 关28兴. A preliminary picture that emerges from such analysis is that, while one-step RSB is enough to give locally stable results and suggests a reason-able physical picture it may fail to be globally correct. We think that the RSB-1 calculation describes qualitatively the main features of this difficult problem, but a full continuous RSB scheme 关28兴 will be necessary to understand the ther-modynamic equilibrium bounds obtainable from the minimi-zation of a potential. Even this will not tell the complete story, however, since issues dealing with effective learning times will still remain. This work, together with the results of 关16–18,13兴, suggests that in the optimization of computa-tionally hard discrete problems it might be a better strategy to first leave the space of configurations, in this case the vertices of the hypercube, then optimize in the hypersphere, and finally go back to the original space, instead of striving to respect the discreteness constraints at every step.

ACKNOWLEDGMENTS

The authors thank M. Copelli and O. Kinouchi for impor-tant discussions and suggestions and also the participants of the Workshop on Statistical Mechanics, Max Planck Insti-tute, Dresden共1999兲 for discussions on this subject. E.B. and N.C. were partially sponsored by the Conselho Nacional de Desenvolvimento Cientı´fico e Tecnologico 共CNPq兲 and

FIG. 4. The effective potentialE(t) as a function of the stability

(8)

C.R.M. by the Fundac¸a˜o para o Desenvolvimento da Unesp 共FUNDUNESP兲.

APPENDIX

Equation共23兲 leads to an integral equation for the poten-tial:

Dtoexp关⫺xoE共t兲兴⫽

H

u

b ⬅A共u兲, 共A1兲 where t

qou

1⫺qoto, b⫽kˆ

qo/4␥2, and Dto ⫽(dto/

2␲)e⫺to 2

/2. To obtain the effective potential, note that the integral on the left side is a convolution, so it seems natural to perform a Fourier transformation. The fact that A(u) is not square integrable is bypassed by defining a new problem:

Dtoexp关⫺xoE共t兲兴⫽

H

u

b

⬅A共u兲 共A2兲 Here H

u

⫺⬁ ⬁ D yi⌰␰

yiu

where ⌰

yu

⫽exp

⫺␰

yu

2

if y⬎⫺u/

⌫ and 0 otherwise. Once E(t) is found, we will take the regularizing parameter␰to zero. After Fourier trans-forming, dividing by the Gaussian on the left, and Fourier transforming back, we get

E⬅exp关⫺xoE共t兲兴

qo

dk 2␲

e (1⫺qo)k2/2

eiqokuA共u兲du

e⫺ikt. 共A3兲 To be able to perform the above integrals we use a simple replica trick, which consists in considering b as an integer,

H

u

b

i⫽1 b

⫺u/冑⌫D yiexp

⫺␰

yu

2

i⫽1 b

⫺⬁ ⬁ D yi⌰␰

yiu

. 共A4兲 Introduce the Fourier transforms

共yi⫺x兲⫽

f共ri兲e iri(yi⫺x)dr i, e⫺yi 2/2

g共vi兲e⫺iviyidv i. The Fourier transform of A(u) is

eiqoku

H

u

b du

eiqoku

i⫽1 b

dyi

2␲dridvif共ri兲g共vi

⫻exp

i

ri

yiu

⫺i

viyi

册冎

du. The u and yiintegrals are now automatic. Going back to the equation for the potential

E

qodk 共2␲兲⫺b/2e (1⫺qo)k2⫺ikt/2

⫺⬁ ⬁

i⫽1 b drif共ri兲g共ri

qok

ri

, we now integrate over k:

E

⫺⬁ ⬁

i⫽1 b

2␲drif共ri兲g共ri

⫻exp

12 共1⫺q⌫q oo

ri

2

exp⫺i

ri

⌫qo t.

The linearization of the兺ri is done by a standard trick. Use the fact that g(r) is a Gaussian and that f(r) ⫽(1/2␲)兰⌰( y )e⫺iyrd y ; then E

dx

2␲e ⫺x2/2

dy

2␲⌰␰共y兲

⫺⬁ ⬁ dre⫺r2/2e⫺x[冑(1⫺qo)/⌫qo]re⫺iKr

b ,

where K⫽t/

⌫qo⫹y. Integrating over r, E

dx

2␲e ⫺x2/2

d y

2␲⌰␰共y兲exp 1 2

共1⫺qo⌫qo x2⫺K2 ⫹2ixK

共1⫺qo⌫qo

b

and b can be again taken to be real. Changing the integration variable to K and taking␰ to zero,

E0⫽

dx

2␲exp⫺

1 2

1⫺ 共1⫺qo⌫qo b

x2

DK

Kt

⌫qo

exp⫹ixK

共1⫺qo⌫qo

b .

Extending the definition of the H function to complex argu-ments, the potential can be written formally as Eq.共30兲.

(9)

关1兴 M. Mezard, G. Parisi, and M. Virassoro, Spin Glass Theory and Beyond共World Scientific, Singapore, 1987兲.

关2兴 T.L.H Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys., 65, 499共1993兲.

关3兴 M. Opper and W. Kinzel, in Models of Neural Networks III, edited by E. Domany, J. L. van Hemmen, and K. Schulten 共Springer-Verlag, Berlin, 1996兲.

关4兴 A. Engels and C. Van den Broeck, Statistical Mechanics of Learning共Cambridge University Press, Cambridge, 2000兲. 关5兴 L. Pitt and L. Valiant, J. Assoc. Comput. Mach. 35, 965

共1988兲.

关6兴 E. Gardner and B. Derrida, J. Phys. A 21, 271 共1988兲. 关7兴 G. Gyo¨rgyi, Phys. Rev. Lett. 64, 2957 共1990兲.

关8兴 D. Haussler, M. Kearns, H.S. Seung, and N. Tishby, Mac. Learning 25, 195共1996兲.

关9兴 H. Horner, Z. Phys. B: Condens. Matter 86, 291 共1992兲; 87, 371共1992兲.

关10兴 M. Golea and M. Marchand, J. Phys. A 26, 5751 共1993兲. 关11兴 C. Van den Broeck and M. Bouten, Europhys. Lett. 22, 223

共1993兲.

关12兴 D. Bolle´ and G.M. Shim, Network Comput. Neural Syst. 6, 619共1995兲.

关13兴 L. Reimers, M. Bouten, and B. Van Rompaey, J. Phys. A 29, 6247共1996兲.

关14兴 M. Bouten, L. Reimers, and B. Van Rompaey, Phys. Rev. E

58, 2378共1998兲.

关15兴 R.W. Penney and D. Sherrington, J. Phys. A 26, 6173 共1993兲. 关16兴 M. Copelli and C. Van den Broeck, Phys. Rev. E 61, 6971

共2000兲.

关17兴 M. Copelli, M. Bouten, C. Van den Broeck, and B. Van Rom-paey, Europhys. Lett. 47, 139共1999兲.

关18兴 M. Copelli, C. Van den Broeck, and M. Opper, J. Phys. A 32, L555共1999兲.

关19兴 W. Kinzel and R. Urbanczik, J. Phys. A 31, L27 共1998兲. 关20兴 S. Solla and O. Winther, in On-Line Learning in Neural

Net-works, edited by D. Saad共Cambridge University Press, Cam-bridge, 1998兲.

关21兴 M. Opper, Phys. Rev. Lett. 77, 4671 共1996兲; in On-Line Learn-ing in Neural Networks共Ref. 关20兴兲.

关22兴 O. Kinouchi and N. Caticha, Phys. Rev. E 54, R54 共1996兲. 关23兴 G.J. Bex, R. Seneerls, and C. Van den Broeck, Phys. Rev. E

51, 6309共1995兲.

关24兴 M. Opper and D. Haussler, Phys. Rev. Lett. 66, 2677 共1991兲; in Proceedings of the IVth Annual Workshop on Computa-tional Learning Theory (COLT91) 共Morgan Kauffman, San Mateo CA, 1991兲.

关25兴 M. Bouten, J. Schieste, and C. Van den Broeck, Phys. Rev. E

52, 1958共1995兲.

关26兴 O. Kinouchi and N. Caticha, J. Phys. A 25, 6243 共1992兲. 关27兴 N. Caticha and E. Arau´jo de Oliveira, Universidade de Sa˜o

Paulo report共unpublished兲.

关28兴 E. Botelho, C. Mattos, and N. Caticha 共unpublished兲. 关29兴 W. Krauth and M. Mezzard, J. Phys. 共Paris兲 50, 3057 共1989兲. 关30兴 H.S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A 45,

6056共1992兲.

关31兴 A. Buhot, J.M. Torres Moreno, and M.B. Gordon, Phys. Rev. E 55, 7434共1997兲.

关32兴 N. Caticha and O. Kinouchi, Philos. Mag. B 77, 1565 共1998兲. 关33兴 S. Amari, Neural Networks 6, 161 共1993兲.

关34兴 N. Barkai, H.S. Seung, and H. Sompolinsky, Phys. Rev. Lett.

75, 1415共1995兲.

关35兴 O. Winther, B. Lautrup, and J-B. Zhang, Phys. Rev. E 55, 836 共1997兲.

Referências

Documentos relacionados

i) A condutividade da matriz vítrea diminui com o aumento do tempo de tratamento térmico (Fig.. 241 pequena quantidade de cristais existentes na amostra já provoca um efeito

Peça de mão de alta rotação pneumática com sistema Push Button (botão para remoção de broca), podendo apresentar passagem dupla de ar e acoplamento para engate rápido

Alguns ensaios desse tipo de modelos têm sido tentados, tendo conduzido lentamente à compreensão das alterações mentais (ou psicológicas) experienciadas pelos doentes

Extinction with social support is blocked by the protein synthesis inhibitors anisomycin and rapamycin and by the inhibitor of gene expression 5,6-dichloro-1- β-

didático e resolva as ​listas de exercícios (disponíveis no ​Classroom​) referentes às obras de Carlos Drummond de Andrade, João Guimarães Rosa, Machado de Assis,

Ao Dr Oliver Duenisch pelos contatos feitos e orientação de língua estrangeira Ao Dr Agenor Maccari pela ajuda na viabilização da área do experimento de campo Ao Dr Rudi Arno

Neste trabalho o objetivo central foi a ampliação e adequação do procedimento e programa computacional baseado no programa comercial MSC.PATRAN, para a geração automática de modelos

Ousasse apontar algumas hipóteses para a solução desse problema público a partir do exposto dos autores usados como base para fundamentação teórica, da análise dos dados