Two approximations to learning from examples

Diambra, L.

doi:10.1590/S0103-97331998000400020

Abstract

We investigate the learning of a rule from examples of the case of boolean perceptron. Previous studies of this problem have been made using the full quenched theory. We consider here two alternative approaches that can be applied easily. the two-replicas interactions approach considerably improves upon the well-known first-order approach. The mean field approach proved some results that have been obtained previously using the complex full quenched theory. Both approximations have been applied to both continuous weights and discrete weights perceptron.

Two approximations to learning from examples

L. Diambra^* * e-mail: diambra@uspif1.if.usp.br

Instituto de Física, Universidade de São Paulo

C.P. 66318, cep 05315-970, São Paulo, Brazil

Received 22 June, 1998

We investigate the learning of a rule from examples of the case of boolean perceptron. Previous studies of this problem have been made using the full quenched theory. We consider here two alternative approaches that can be applied easily. the two-replicas interactions approach considerably improves upon the well-known first-order approach. The mean field approach proved some results that have been obtained previously using the complex full quenched theory. Both approximations have been applied to both continuous weights and discrete weights perceptron.

I Introduction

The replica formulation of the statistical mechanics of disordered systems has become a major research tool in the study of complex systems [1]. In recent years, the learning from examples in feedforward neural networks has been exhaustively studied in the framework of statistical mechanics [2, 3, 4, 5]. In this field, the replica trick (RT) has become a very useful tool for the investigation of learning and generalization processes in neural networks. The popularity of the replica method is based mainly on the elegance and simplicity of the formulation of the so-called quenched theory.

In the context of learning in single-layer feedforward neural networks, the full quenched theory has been applied successfully within the framework of the replica symmetry (RS) and one-step replica symmetry breaking[4, 6, 7]. However, it is possible to extract some information in a simpler way, borrowing techniques from other fields of physics and applying them to the replicated Hamiltonian. Recent studies using high-temperature limit approximation and annealed approximation have been developed and applied to feedforward neural networks. These approaches are able to predict the right behavior in some cases, but cannot introduce the disorder effects produced by the randomness of the examples. These effects become essential with decreasing temperature.

In the present effort we wish to study the generalization process in the perceptron with boolean output, the so-called boolean perceptron (BP), using two different approximations: two-replicas approximation (TRA) obtained by recourse of a pertubative expansion, and mean field approximation (MFA). Our results can be extended to a perceptron with linear output.

We consider learning by a single-layered perceptron [8] within a statistical mechanics environment [9, 10, 11]. Our neural network (NN) has N input units S_i connected to a single output unit s, whose state is given by s( S,W) = g( N^-1W·S) , where g( x) is the transfer function. For each set W of weights, the NN maps S onto s. Learning is said to take place whenever the W_i are chosen so that s closely approaches the desired, correct map s₀( S) = g( N^-1W₀·S) . Within the supervised learning scheme [12] one reaches this goal by recourse to a cost function that is constructed on the basis of P examples {S^l,s₀( S^l) } with l = 1,¼,P. Here we assume that the inputs S^l are randomly selected according to probabilities D( S) , (we have considered here a Gaussian distribution) from the input spaces.

The learning process has been regarded as a stochastic dynamics, associated to the minimization of an energy function E_t, where the NN weights evolve according to a Langevin-like relaxation prescription that leads to a Gibbsian probability distribution for the weights [13, 14, 15]

with b = 1/T and T a "temperature'' characterizing the noise level in the learning process. The normalization factor Z is the partition function. The training energy E_t is defined by

where e( W,S) is the mistake function, a measure of the deviation between actual and correct outputs. Here we focus our attention upon perceptrons with binary output, for which e( W,S) = q(-N^-1/2( W·S) ( W₀·S) ) , q stands for the Heaviside function.

The remainder of the paper is organized as follows: Section II is devoted to a brief recapitulation of basic concepts concerning the replica formulation in the BP. The two approximations that interests us here (TRA and MFA), are derived in Section III. The thermodynamics of the BP, with Ising weights and continuous weights, are analyzed in Section IV. Finally, some conclusions are drawn in Section V.

II The Replica Method

The energy of the systems depends upon the particular training examples selected. Therefore, the associated "macroscopic'' observables are evaluated by a double averaging procedure involving two spaces: thermal average over the weight space with probability distribution P( W) , to be denoted by á ... ñ _T and a so-called "quenched average'' over all possible inputs, to be represented by << ... >> º òÕ_ldm(S^l), where dm(S^l) is some measure. Here we are using the standard Gaussian measure: dm(S) = D( S) dS.

The NN free energy F is given in terms of the latter type of average by

where

The NN performance over the space of examples is characterized by the average generalization error e_g, while the performance related to the training set is given by the average training error e_t, i.e.

where e( W) = òdm(S)e( W,S) is the generalization function. Graphs of either e_g(T, P) and e_t(T, P) versus a = P/N are called learning curves.

The RT is the usual tool employed to evaluate the average over the examples [13, 14], and it originated within the context of spin glasses [15, 16]. The RT is recommended whenever it is feasible to evaluate averages of Z, but not the ones for ln Z. The RT exploits the identity

where Zⁿ can be regarded as the partition function of n identical non-interacting systems, copies of the original one. They are identified by the label g = 1, ...., n. In performing the averaging process over the examples, coupling arises between the distinct copies.

From (3), (4) and (7) the free energy F becomes

where the replicated Hamiltonian H is an intensive quantity that does not depend upon the number of examples N, and it is given by

The evaluation of H in the boolean perceptron is standard by now [4], so we present the results only

This replicated Hamiltonian depends on the weights through the order parameters R_g and Q_gd given by

Since H( R_g ,Q_gd) depends on the weights only through parameters R_g and Q_gd above defined, the replicated partition function can be written as an integral over these order parameter introducing auxiliary parameters

where

is the logarithm of the density of replicated networks with the overlaps R_g and Q_gd.

In the thermodynamical limit N® ¥ the integral (8) receives an overwhelming contribution from the minimum of the variables R_g and Q_gd.

Here some physical reasoning is needed in order to simplify things. Since the replicas have no a priori physical meaning, it is reasonable to assume that all replicas have the same overlap with the teacher NN and that, further, the overlaps between two of them are symmetric under permutation of the replica indices. This assumption constitutes the RS ansatz. Therefore, we have

Using the substitutions for R_g and Q_gd (in a similar way for and ) in (10) and performing the integrals over and , and then evaluating the limit n® 0, the training error for the boolean perceptron takes the form

where and erf (x) is the standard error function. The thermodynamical study of the problem would involve a considerable effort, so it should be useful to study a simpler approach. Borrowing ideas from other fields of physics we consider two approximations. Of course, we pay the customary price: the approximation is valid just for some appropiate range of b.

III Two approximation for BP

A. Two-replica approximation

It is our goal here to introduce a perturbative treatment that enables one to incorporate the disorder effects produced by the randomness in the examples. We shall consider an expansion of H given by (9) in powers of b and then consider the Hamiltonian that incorporates the two-replica interactions [17]

with

H₁ represents the "non-random'' part of the training energy, and coincides with the generalization function e( W) , which depends only upon the overlap R, for a BP is given by 1/pcos^-1( R) [4, 5]. On the other hand, H₂ represent two-replica coupling arising from the randomness of the training examples. When T diminishes, this coupling becomes more and more important so that one needs to consider H₂ contributions. One has

Of course, higher order terms in b are associated with three-replica coupling, four-replica ones and so on. Replicas can be regarded as particles with N degrees of freedom. The first term in (15) describes the coupling of the particles with an external field, while the second one represents two-particle interactions via an effective potential depending upon the Hamming distance between the replicas.

The H₂ contributions lead to consideration of the integral of correlation

In second order, the replicated Hamiltonian in TRA for the BP reads (see details in the ^Appendix Appendix )

where second-order terms in the total number of replicas n have been eliminated. The relevant parameter here is Q_gd, which does not appear at high temperature limits. The temperature T is associated with a coupling constant. It is reasonable to expect our expansion to yield an adequate treatment for T > 1. Using the RS approximation for R_g , Q_gd, and , passing to the limit n® 0, we are in a position to write

where

B. Mean field approximation

In some cases, the coupling between replicas produces only minor changes in the learning curves and the phase diagrams. In other cases, such terms can lead to the appearance of qualitalively different phases at low temperatures. These phases are conveniently described by the properties of the matrix Q_gd which measures the overlap of the weights of two copies of the systems. Since the replicated Hamiltonian is invariant under permutation of the replica indices, one naively would expect that Q_gd = q for all g ¹ d, where q is given by

This characterizes the typical overlap of the solutions to the constraints posed by the examples. As a increases, more and more correlations are to be found between the different solutions, and q approaches unity. For a = a_cr, we have q = 1 and the concomitant degeneration is broken. This parameter is known as the Edward-Anderson parameter in spin glass theory, and reflects the degeneracy of the ground states. On the other hand, the expected value of the overlap with the teacher is given by R = N^-1 << á W ñ_T >> ·W₀. Keeping this correspondence in mind, we are interested in considering an approximation like mean field theory. We substitute q = R² in (14), and the training error in MFA becomes

We can see that this approximation takes into account the degeneration of the ground states because it preserves the structure of the matrix Q_gd within the framework of the replica symmetry.

IV Analysis of the Results

A. Ising-weights perceptron

Now, evaluation of the expression (22) using the adequate constraint over weights space becomes mandatory. First we consider a BP with Ising weights. In this case, the adequate a priori measure of the weights is dm( W) = Õ_idW_i[ d( W_i-1)+d( W_i+1) ] and the expression (22) becomes

The free energy function in TRA is given by (20) with e_t and s given by (21) and (24), respectively. Extremalizing the free energy with respect to the parameters R, , q and and eliminating and , we obtain the pertinent saddle point equations

In the limit b® 0, we recover the high temperature results, with the mean field ansatz q = R², as a bonus. This relationship cannot be obtained in the first-order treatment, as it does not involve the parameter q. This result indicates that the mean field relationship is exact in the high temperature limit.

On the other hand, since the MFA Hamiltonian depends only on the overlap R, the free energy function is given (20) where e_t is now given by (23), and s is the logarithm of the density of the perceptrons with overlap R, given by

The corresponding saddle point equation is obtained extremalizing with respect to the variables R and . Eliminating we obtain the thermodynamical equilibrium state

Both equations (25) and (27) describe the first-order transition from a state with poor generalization to a state perfect generalization with R = 1. Fig. 1 depicts the phase diagrams for both approximations. At any fixed T, to left of thermodynamic transition line (a < a_th) there are two solutions, one with R = 1, and one with R < 1. The state of poor generalization (R < 1) is the equilibrium state, while the state of perfect generalization (R = 1) is metastable. In the region between the thermodynamic transition line and the spinodal line (first-order transition), the situation reverses, with R = 1 becoming the equilibrium state, and R < 1 the metastable state. To right of spinodal line (when a > a_sp), there is only one solution with R = 1, there is no metastable state in this phase. Anomalies in the phase diagram arise at low temperatures (T = 0.5), which is an effect of the approximation in the TRA. The phases of poor generalization, metastable, and perfect generalization are in indicated in the Fig.1 with I, II, and III, respectively.

Figure 1.
Phase diagram obtained with TRA and with MFA. The full lines correspond to TRA and the dashed lines to MFA. Thermodynamical (Th.) and spinodal (Sp.) curves has been indicated.

The training errors in TRA are given by (21), while in MFA by (23). On the other hand, the generalization error is given by e_g = 1/pcos^-1(R) . Fig. 2 displays the learning curves for both approximations.

Figure 2.
Learning curves for the Ising perceptron computed with TRA, MFA, and complete quenched theory (CQT) at T = 1. The full lines correspond to the generalization errors, the dashed lines correspond to the training errors.

Some features of our approach deserve particular mention. In Fig. 2, the spinodal transition at T = 1 takes place at a_sp = 2.25 in MFA. While in TRA the spinodal transition takes place at a_sp = 2.95, which agrees with the more elaborate complete quenched theory (CQT) [4]. Our results considerably improve upon the first-order approximation, for which a_sp = 2.08. In addition, unlike high temperature limit approximation, e_t and e_g are different in both TRA and MFA.

B. Spherical-weights perceptron

Now, we derive the equilibrium properties for the BP with spherical weights. In this case, the evaluation is somewhat more complicated. We write the a priori distribution as

and the entropy (22) is now given by

The additional parameter is the Lagrange multiplier associated with the spherical constraint. Following the basic ideas presented in the previous sections, we derive the equilibrium states by extremalization of the free energy (where the entropy is given now by (29)) with respect to the parameters R, , q, and l. Eliminating , , and l we obtain the pertinent saddle point equations for the TRA

In the limit b® 0 the parameter q is zero. This result indicates that the coupling between replicas in the perceptrons with spherical weights is weaker than in the case with Ising weights.

Similarly, the free energy function in the MFA can be written as (20) where e_t is given by (23) and the entropy can be computed as the fraction of the weight space with an overlap R, which is simply the volume of the (N-2) -dimensional sphere with radius . In the thermodynamical limit we have s = 1/2ln( 1-R²) , and the thermodynamical equilibrium is given by the concomitant saddle point equation

Unlike the Ising-weights perceptron, the equations (30) and (31) do not manifest transition to perfect learning at any T and a values. The learning curves fall with a 1/a tail for all T, in agreement with the correct power law. The asymptotic behavior of the generalization error in MFA is

Note that at T = 0 the prefactor is 0.5, slightly inferior to the correct prefactor 0.625, while the annealed approximation predicts 1.

The training error and the generalization error are displayed in Fig. 3 at T = 1 for both approximations.

Figure 3.
Learning curves for the Ising perceptron computed with TRA, MFA, and complete quenched theory (CQT) at T = 1. The full lines correspond to the generalization errors, the dashed lines correspond to the training errors.

V Conclusions

We have presented two approximations for the boolean perceptron. These approaches introduce, in differents way, the disorder effects produced by the random examples. They have been able to reproduce the behavior of the system within the range of b where RS is right. On the other hand, TRA allows us to establish that the coupling between replicas in perceptron with continuous weights is weaker than the one with discrete weights.

In any case, we hope to have convinced the reader that these techniques are satisfactory tools for investigating, at not too low temperatures, the thermodynamics of the learning process.

VI Acknowledgments

The author wishes to thank Coraci P. Malta for helpful conversations. The work of L. D. was supported by Conselho Nacional de Desenvolvimeto Científico e Tecnológico do Brazil, CNPq (Brazil).

VII Appendix

We undertake here the calculation of the correlations (18). We recast the first of them in the form

i.e.,

By recourse to the representation d(x) = exp(ixx^¢) of the delta function and remembering that D(S) = Õ_i^N( 2p) ^-1/2exp, the integration process over dS leads to the (intermediate) result

Performing the integrations over the variables r and r', we obtain

Assuming the RS Rg = R , , we find for the n diagonal terms, on one hand,

and, for the n² - n terms, on the other hand (terms of second order in n neglected),

Therefore, within the RS ansatz the correlations are given by

These contributions come from the randomness in the examples.

References

[1] M. Mezard, G. Parisi, and M. A. Virasoro, Spin Glass Theory and Beyond (World Scientific, Singapore, 1987).

[2] E. Gardner, J. Phys A. 21, 257 (1988).

[3] E. Gardner, and B. Derrida, J. Phys A. 21, 271 (1988).

[4] H.S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A 45, 6056 (1992); H. Sompolinsky, N. Tishby, and H.S. Seung, Phys. Rev. Lett 65, 1683 (1990).

[5] T. Watkin, A. Rau, and M. Biehl, Rev. Mod. Phys. 65, 499 (1993).

[6] W. Krauth and M. Mezard, J. Phys. (Paris) 50, 3057 (1989).

[7] J.F. Fontanari and R. Meir, J. Phys. A 26, 1077 (1993).

[8] D.E. Rumelhart, and J.L. McClelland, Parallel Distributed Processing, (MIT, Cambridge, MA., 1986).

[9] N. Tishby, E. Levin, and S. Solla, in Proceedings of the Internatinal Joint Conference on Neural Networks (IEEE,New York,1989), Vol 2, pp. 403-409.

[10] E. Levin, N. Tishby, and S. Solla, Proc. IEEE 78, 1568 (1990).

[11] J.A. Hertz, in Statitical Mechanics of Neural Networks: Proceedings of the Eleventh Sitges Conference, edited by L.Garrido (Springer, Berlin,1990).

[12] J. Schrager, T. Hogg, and B.A. Hubermann, Science 242, 414 (1988).

[13] S.F. Edwards and P.W. Anderson, J. Phys. F 5, 965 (1980).

[14] G. Parisi, J. Phys. A 13,1101 (1980).

[15] D. Sherrington and S. Kirkpatrick, Phys. Rev. Lett. 35, 1792 (1975).

[16] S. Kirkpatrick and D. Sherrington, Phys. Rev. B 17, 4384 (1978).

[17] L. Diambra and A. Plastino, Phys. Rev. E 53, 3970 (1996).

[4] H.S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A 45, 6056 (1992); H. Sompolinsky, N. Tishby, and H.S. Seung, Phys. Rev. Lett 65, 1683 (1990).

Appendix

*

e-mail:

diambra@uspif1.if.usp.br

Publication Dates

Publication in this collection
26 Aug 1999
Date of issue
Dec 1998

History

Received
22 June 1998

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

[5] [4] H.S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A 45, 6056 (1992); H. Sompolinsky, N. Tishby, and H.S. Seung, Phys. Rev. Lett 65, 1683 (1990).

Brasil

Brasil

Two approximations to learning from examples

Abstract

Appendix

Publication Dates

History