# Bayesian Neural Networks

Christopher M. Bishop
Neural Computing Research Group
Department of Computer Science and Applied Mathematics
Aston University, Birmingham, B4 7ET, U.K

Abstract: Bayesian techniques have been developed over many years in a range of different fields, but have only recently been applied to the problem of learning in neural networks. As well as providing a consistent framework for statistical pattern recognition, the Bayesian approach offers a number of practical advantages including a solution to the problem of over-fitting. This article provides an introductory overview of the application of Bayesian methods to neural networks. It assumes the reader is familiar with standard feed-forward network models and how to train them using conventional techniques.
Keywords: Bayesian techniques, statistical pattern recognition, feedforward networks

1 Introduction

Conventional approaches to network training are based on the minimization of an error function, and are often motivated by some underlying principle such as maximum likelihood. Such approaches can suffer from a number of deficiencies, including the problem of determining the appropriate level of model complexity. More complex models (e.g. ones with more hidden units or with smaller values of regularization parameters) give better fits to the training data, but if the model is too complex it may give poor generalization (the phenomenon of over-fitting). The usual approach is to set aside data to form a validation set and to optimize the model complexity to give the best validation set performance.

The Bayesian viewpoint provides a general and consistent framework for statistical pattern recognition and data analysis. In the context of neural networks, a Bayesian approach offers several important features including the following:

• The conventional approach to network training, based on the minimization of an error function, can be seen as a specific approximation to a full Bayesian treatment.
• Similarly, the technique of regularization arises in a natural way in the Bayesian framework. The corresponding regularization parameters can be treated consistently within the Bayesian setting, without the need for techniques such as cross-validation.
• For regression problems, error bars, or confidence intervals, can be assigned to the predictions generated by a network.
• For classification problems, the tendency of conventional approache s to make overconfident predictions in regions of sparse data can be avoided.
• Bayesian methods provide an objective and principled framework for dealing with the issue of model complexity (for example, how to select the number of hidden units in a feed-forward network), and avoids many of the problems of over-fitting which arise when using maximum likelihood.

In this paper we give an introductory account of Bayesian methods and their application to neural networks, with the focus on underlying principles rather than mathematical details. A more comprehensive introduction to the Bayesian treatment of neural networks can be found in Chapter 10 of [3].

1.1 Bayes' Theorem

We are quite used to the idea of dealing with uncertainty in our everyday lives. For example, we might believe that it is unlikely to rain tomorrow if the last few days have been sunny. However, if we then discover that a cold front is about to arrive, we might revise our views and decide that it is in fact quite likely to rain. Here we are discussing subjective beliefs, and the way they are modified when we obtain more information. We might seek to put such reasoning on a more formal footing, and to quantify our uncertainty by encoding the degrees of belief as real numbers. In a key paper, [4] showed that, provided we impose some simple consistency requirements, then these numbers obey the rules of conventional probability theory. In other words, if we use a value of 1 to denote complete certainty that an event will occur, and 0 to denote complete certainty that the event will not occur, with intermediate values representing corresponding degrees of belief, then these real values behave exactly like conventional (frequentist) probabilities.

Once our beliefs have been represented as probabilities they can be manipulated using two simple rules. Consider a pair of random variables A and B each of which can take on a number of discrete values. We denote by P(a,b) the joint probability that A = 3Da and B = 3Db . Using the product rule this joint probability can be expressed in the form

 P(a,b) = 3DP(b|a)P(a). (1)

Here P(b|a) denotes the conditional probability, in other words the probability that B = 3Db given that A = 3Da . We can similarly consider a conditional probability of the form P(a|b) . The quantity P(a) in (1) denotes the marginal probability, in other words the probability that A = 3Da irrespective of the value of B . The second relation between probabilities that we need to consider is the sum rule given by

 (2)

where the sum is over all possible values of b . From the product rule we obtain the following relation

 (3)

which is known as Bayes' theorem . Using the sum rule, we see that the denominator in (3) is given by

 (4)

and plays the role of a normalizing factor, ensuring that the probabilities on the left hand side of (3) sum to one. For continuous rather than discrete variables, the probabilities are replaced by probability densities, and summations are replaced by integrations.

We can consider P(a) to be the prior probability of A = 3Da before we observe the value of B , and P(a|b) to be the corresponding posterior probability after we have observed the value of B . Posterior probabilities play a central role in pattern recognition, and Bayes' theorem allows them to be re-expressed in terms of quantities which may be more easily calculated.

As we shall see, we can treat the problem of learning in neural networks from a Bayesian perspective simply by application of the above rules of probability. This leads to a unique formalism which in principle is simple to apply, and which can lead to some very powerful results. We shall also see, however, that the application of Bayesian inference to realistic problems presents many difficulties which require careful analytical approximations or sophisticated numerical approaches to resolve.

1.2 Model Comparison

As we have already indicated, a Bayesian approach allows the model complexity issue to be tackled using only the training data. To gain some insight into how this comes about, consider a hypothetical example of three different models, H 1 , H 2 and H 3 , which we suppose have steadily increasing flexibility, corresponding for instance to a steadily increasing number of hidden units. Thus, each model consists of a specification of the network architecture (number of units, type of activation function, etc.) and is governed by a number of adaptive parameters. By varying the values of these parameters, each model can represent a range of input-output functions. The more complex models, with a greater number of hidden units for instance, can represent a greater variety of such functions. We can find the relative probabilities of the different models, given a data set D , using Bayes' theorem in the form

 (5)

The quantity p(H i) represents a prior probability for model H i. If we have no particular reason to prefer one model over another, then we would assign equal priors to all of the models. Since the denominator p(D) does not depend on the model, we see that different models can be compared by evaluating p(D|H i) , which is sometimes called the evidence for the model H i [6]. This is illustrated schematically in Figure 1, where we see that the evidence favours models which are neither too simple nor too complex.

Figure 1: Schematic example of three models, Hi, H2 and H3 which have successively greater complexity, showing the probability (known as the evidence) of different data sets D given each model Hi. We see that more complex models can describe a greater range of data sets. Note, however, that the distributions are normalized. Thus, when a particular data set D0 is observed, the model H2 has a greater evidence than either the simpler model H1 or the more complex model H3.

1.3 Marginalization

An important concept in Bayesian inference is that of marginalization , which involves integrating out unwanted variables. Suppose we are discussing a model with two variables w and t . Then the most complete description of these variables is in terms of the joint distribution p(t,w) . If we are interested only in the distribution of t then we should integrate out w as follows:

(6)

Thus the predictive distribution for t is obtained by averaging the conditional distribution p(t|w) with a weighting factor given by the distribution p(w) . We shall encounter several examples of marginalization later in this paper.

2 Regression Problems

In this section we discuss the application of Bayesian methods to "regression", i.e. the prediction of the values of continuous output variables given the values of some input variables. The application of Bayesian methods to classification problems is described in [7] and [3]. We consider a feed-forward network model (for example a multi-layer perceptron) which maps an input vector x to an output value1 y , and which is governed by a vector w of adaptive parameters (weights and biases). The observed dataset D consists of N input vectors x n and corresponding targets tn , where n = 3D1,...,N .

2.1 Distribution of Weights

We begin by finding the posterior distribution p(w |D) of the weight vector w once we have observed the dataset D . Note that this description of our state of knowledge of the weights, in terms of a probability distribution, is in contrast to the conventional approach in which the weights in a trained network take specific values. We shall see shortly that this conventional description corresponds to a particular approximation to the Bayesian framework.

We can find the posterior distribution of weights through the use of Bayes' theorem in the form

 (7)

The conditional distribution of the data p(D | w ) can be regarded as a function of w in which case it is called the likelihood . We shall encounter a specific example shortly. The conventional approach to network training involves seeking a single weight vector w* which maximizes the likelihood function.

The picture of learning provided by the Bayesian formalism is as follows. We start with some prior distribution over the weights given by p(w ) . Since we generally have little idea at this stage of what the weight values should be, the prior might express some rather general properties such as smoothness of the network function, but will otherwise leave the weight values relatively unconstrained. The prior will therefore typically be a rather broad distribution, as indicated schematically in Figure 2. Once we have observed the data, this prior distribution can be converted to a posterior distribution using Bayes' theorem in the form (2.1). This posterior distribution will be more compact, as indicated in Figure 2, expressing the fact that we have learned something about the extent to which different weight values are consistent with the observed data.

Figure 2: Schematic plot of the prior distribution of weights p(w) and the posterior distribution p(w | D) which arise in the Bayesian inference of network parameters. The most probable weight vector wMP corresponds to the maximum of the posterior distribution. In practice the posterior distribution will typically have a complex structure with many local maxima.

In order to evaluate the posterior distribution we need to provide expressions for the prior distribution p(10#10w ) and for the likelihood function p(D | w ). One of the simplest choices for the prior is to assume that it is a zero-mean Gaussian function of the weights, of the form

 (8)

in which the normalization factor ZW(a) is given by

 (9)

where W represents the total number of weight parameters. Since the inverse variance 16#16 of the Gaussian controls the distribution of other parameters (weights and biases), it is called a hyperparameter . For the moment we shall assume that an appropriate value for a is known, and we shall return to the problem of how to treat a in Section 2.3. In practice, more complicated priors are often used, which may contain multiple hyperparameters.

Next we turn to the choice of likelihood function. This can be written down once we have specified a model for the distribution of target values for a given input vector. Again we consider a very simple example, namely a Gaussian with mean given by the output y(x ;w ) of the network, and variance governed by a parameter b-1 so that

 (10)

Again we will assume for the moment that the value of b is known. For the data set D we assume that the patterns are drawn independently from this distribution, and hence that the probabilities are multiplicative, so that

(11)

where the normalization factor ZD(b) is given by

(12)

Our main interest in using neural networks is to predict the values of the output variable for new values of the input variables. From the discussion of Section 1.3, we see that such predictions should be made by integrating over the weight parameters, so that

 (13)

If the posterior distribution p(w |D) is sharply peaked about a maximum w MP , as indicated schematically in Figure 2, then we can approximate the integral on the right hand side of (13) by p(t|x , w MP) . This corresponds to a conventional approach in which predictions are made with the weight vector set to a specific value. To make use of this in practice we need to determine the most probable weight vector w MP . Instead of finding a maximum of the posterior probability, it is usually more convenient to seek instead a minimum of the negative logarithm of the probability which is generally called an error function (these two procedures are equivalent since the negative logarithm is a monotonic function). For the particular prior distribution (8) and likelihood function (11) we see that, neglecting additive constant terms, the negative log probability is given by

 (14)

Up to an overall multiplicative factor, this is just the usual sum-of-squares error function with a "weight decay" regularization term. Thus we have derived the conventional error minimization approach to learning in neural networks as a particular approximation to the Bayesian framework.

2.2 Error Bars

As we have indicated, the Bayesian approach should take account not just of the most probable weight vector, but of the complete posterior distribution of weight vectors. In practice, the required integration over w given by (2.1) is analytically intractable, and so we either need to use numerical techniques, as discussed in Section 4, or to make approximations. Here we consider an approach based on assuming that the posterior can be represented as a Gaussian centred on wMP [8]. This will allow us to predict not only the most probable value of the output vector, but also to assign error bars to this prediction.

We can make a Gaussian approximation to the posterior distribution by representing the error function E(w) in (2.1) by a Taylor expansion around wMP and keeping terms up to second order, so that

 (15)

where A is called the Hessian matrix and consists of the second derivatives of the error function with respect to the weights. Note that the first derivative term is absent from (15) since we are expanding around a local minimum of E(w) . The Hessian matrix can be evaluated exactly and efficiently for a network having an arbitrary feed-forward topology using an extension of the back-propagation procedure [2].

Some partial justification for this approximation comes from the result of [15], which says that, under very general circumstances, a posterior distribution will tend to a Gaussian, and the variance of the Gaussian will tend to zero, in the limit where the number of data points goes to infinity. For very large data sets we might then expect the Gaussian approximation to be a good one. However, the primary motivation for the Gaussian approximation is that it greatly simplifies the analysis.

Even with this Gaussian approximation we still cannot evaluate (2.1) analytically. We therefore assume that the posterior distribution is relatively narrow and hence that the network function does not vary too much over the region of significant probability density. This allows us to approximate the network function y(x ; w) by its linear expansion around w MP

 (16)

where Dw = w - w MP and

 (17)

The integration in (2.1) now becomes Gaussian and can be evaluated analytically with the result

 (18)

This distribution has a mean given by , and a variance given by

 (19)

The first term in (19) arises from the intrinsic noise on the data, and is governed by the parameter b. The second term arises from the uncertainty in the weights, and would go to zero in the limit of an infinite data set. We can use (19) to assign error bars to network predictions, as illustrated for a toy problem in Figure 3.

Figure 3: A simple example of the application of Bayesian methods to a "regression" problem. Here 30 data points have been generated by sampling the function h(x) = 0.5 + 0.4 sin (2p x), and the network consists of a multi-layer perceptron with four hidden units having "yanh" activation functions, and one linear output unit. The solid curve shows the network function with the weight vector set to wMP corresponding to the maximum of the posterior distribution, and the dashed curves represent the ± 2st error bars calculated using (19).

It can be seen from Figure 3 that the error bars are larger in regions where there is little data [16], as we might expect.

2.3 Hyperparameters

So far we have assumed that the hyperparameters a and b are known2. In practice we will generally have little idea of what values these parameters should take. The treatment of hyperparameters is not trivial since a straightforward maximum likelihood approach would give over-fitted models which have poor generalization. For example, the best fit to the data is obtained with a very small value of a allowing the network function to give over-fitted solutions.

As we have discussed already, the correct Bayesian treatment for parameters such as a and b, whose values are unknown, is to integrate them out of any predictions. For example, the posterior distribution of network weights is given by

(20)

Note that we have extended our notation to include dependencies on a and b explicitly in the posterior weight distribution.

In general the integrations required by (20) will be intractable. One analytical approach, known as the evidence approximation [6,8], involves fixing the hyper-parameters to specific values determined by maximization of the evidence p(D | a, b) which itself is found by integration over the weights using the Gaussian approximation. This approach, and its limitations, are discussed in detail in [3]. It is computationally equivalent to the type II maximum likelihood (ML-II) method of conventional statistics [1].

3 Model Comparison Revisited

As discussed in Section 1.2, the Bayesian approach automatically penalizes highly complex models and so is able to pick out an optimal model without resorting to the use of independent data as in methods such as cross-validation. Models can be compared using the evidence (the probability of the observed data set under the given model), which can be evaluated by using

 (21)

The quantity p(D | a, b, H i) is just the evidence for a and b, with the dependence on the model again made explicit. The integration in (21) can be performed by making a Gaussian approximation for p(D | a, b, H i) as a function of a and b around their most probable values. This leads to an expression for the model evidence which involves the determinant of the Hessian matrix.

In practice the accurate evaluation of the evidence can prove to be very difficult. One of the reasons for this is that the determinant of the Hessian is given by the product of the eigenvalues and so is very sensitive to errors in the small eigenvalues. This is in contrast to the determination of the hyperparameters, within the evidence approximation, which effectively involves the trace of the Hessian given by the sum of the eigenvalues.

In our discussions of regression problems in Section 2, we approximated the posterior distribution of weights by a single Gaussian centred on a maximum of the distribution. In practice, we know that the posterior distribution will be multi-modal and that there will often be many local maxima (corresponding to local minima of the error function). To take account of this we can consider an approximation consisting of a Gaussian centred on each of the local maxima found by our optimization algorithm. The posterior distribution of the weights can be represented as

(22)

where mi denotes one of the local maxima. This distribution is used to determine other quantities by integration over the whole of weight space. For instance, the mean network output is given by

(23)

where Gi denotes the region of weight space surrounding the ith local maximum, and is the corresponding network prediction averaged over this region. Thus we see that the overall prediction is given by a linear combination of the predictions of each of the local solutions separately. Such combinations of trained networks are known as committees . In practice, the coefficients are difficult to evaluate accurately (since they correspond to model evidences) and so the committee coefficients may be simply set equal to 1/L where L is the total number of minima, or may be chosen using cross-validation. It is well known that committees often have improved generalization performance, compared with that of a single model.

Similarly, we may wish to consider a number of different models, for example feed-forward networks with different numbers of hidden units. The correct Bayesian treatment again involves a summation over the predictions of all of the models, weighted by the posterior probabilities of the individual models. In practice, if some of the posterior probabilities are very small, the corresponding models can be omitted from the committee.

4 Monte Carlo Methods

In the conventional (maximum likelihood) approach to network training, the bulk of the computational effort is concerned with optimization, in order to find the minimum of an error function. By contrast, in the Bayesian approach, the central operations require integration over multi-dimensional spaces. For example, the evaluation of the distribution of network outputs involves an integral over weight space given by (13). Similarly, the treatment of hyperparameters also involves integrations, as indicated in (20). So far in this paper, we have concentrated on the use of a Gaussian approximation for the posterior distribution of the weights, which allows these integrals to be performed analytically. This also allows the problem of integration to be replaced again with one of optimization (needed to find the most probable weight vector wMP). If we wish to avoid the Gaussian approximation then we might seek numerical techniques for evaluating the required integrals directly.

Many standard numerical integration techniques, which can be used successfully for integrations over a small number of variables, are totally unsuitable for integrals of the kind we are considering, which involve integration over spaces of hundreds or thousands of weight parameters. For instance, if we try to sample weight space on some regular grid then, since the number of grid points grows exponentially with the dimensionality, the computational effort would be prohibitive. We resort instead to various forms of random sampling of points in weight space. Such methods are called Monte Carlo techniques.

The integrals we wish to evaluate take the form

 (24)

where p(w | D) represents posterior distribution of the weights, and F(w) is some integrand. The basic idea is to approximate (24) with the finite sum

 (25)

where {w i} represents a sample of weight vectors generated from the distribution p(w | D).

We must therefore face the task of generating a sample of vectors w representative of the distribution p(w | D) , which in general will not be easy. To do it effectively, we must search through weight space to find regions where p(w | D) is reasonably large. This can be done by considering a sequence of vectors, where each successive vector depends on the previous vector as well as having a random component. Such techniques are called Markov chain Monte Carlo methods, and are reviewed in [11]. The simplest example is a random walk in which at successive steps we have

 (26)

where e is some small random vector, chosen for instance from a spherical Gaussian distribution having a small variance parameter. Note that successive vectors generated in this way will no longer be independent. As a result of this dependence, the number of vectors needed to achieve a given accuracy in approximating an integral using (25) may be much larger than if the vectors had been independent. We can arrange for the distribution of weight vectors to correspond to p(w | D) by making use of the Metropolis algorithm [10], which was developed to study the statistical mechanics of physical systems. The idea is to make candidate steps of the form (26), but to reject a proportion of the steps which lead to a reduction in the value of p(w | D) . This must be done with great care, however, in order to ensure that resulting sample of weight vectors represents the required distribution. In the Metropolis algorithm this is achieved by using the following criterion:

if accept

if accept with

probability (27)

In the case of the integrals needed for Bayesian neural networks, however, this approach can still prove to be deficient due to the strong correlations in the posterior distribution, as illustrated in Figure 4.

Figure 4: When the standard Metropolis algorithm is applied to the evaluation of integrals in the Bayesian treatment of neural networks, a large proportion of the candidate steps are rejected due to the high correlations in the posterior distribution. Starting from the point wold, almost all potential steps (shown by the arrows) will lead to a decrease in p(w | D). This problem bacomes more severe in spaces of higher dimensionality.

This problem can be tackled by taking account of information concerning the gradient of p(w | D) and using this to choose search directions which favour regions of high posterior probability. For neural networks, the gradient information is easily obtained using standard back-propagation (recall that - lnp(w | D) is an error function). Great care must be taken to ensure that the gradient information is used in such a way that the distribution of weight vectors which is generated corresponds to the required distribution. A procedure for achieving this, known as hybrid Monte Carlo, was developed by [5], and was applied to the Bayesian treatment of neural networks by Neal [12].

By using the hybrid Monte Carlo algorithm it is possible to generate a suitable sample of weight vectors wi for practical applications of neural networks in reasonable computational time. For a given test input vector x , the corresponding network predictions y(x; w i) represent a sample from the distribution p(y | x ,D) . This allows the uncertainties on the network outputs, associated with a new input vector, to be assessed. Hyperparameters can be integrated over at the same time as the weights, using the technique of Gibbs sampling .

Estimation of the model evidence, however, remains a challenging problem. Another significant problem with Monte Carlo methods is the difficulty in defining a suitable termination criterion. Despite these drawbacks, Monte Carlo techniques offer a promising approach to Bayesian inference in the context of neural networks.

5 Conclusions

In this paper we have shown how the problem of learning in neural networks can be treated in a consistent, probabilistic framework using the techniques of Bayesian inference. We have also indicated how conventional approaches based on error minimization, represent a particular approximation to the Bayesian framework. Careful comparison of Bayesian methods with conventional techniques [12,13] show that Bayesian methods generally outperform conventional approaches on difficult tasks. Applications of these methods to large scale problems are beginning to emerge [9,14] and there can be little doubt that the Bayesian framework will play an increasingly important role in neural computing.

References

[1] J. O. Berger. Statistical Decision Theory and Bayesian Analysis (Second ed.) New York : Springer-Verlag, 1985.

[2] C. M. Bishop. Exact calculation of the Hessian matrix for the multilayer perceptron. Neural Computation 4(4):494-501, 1992         [ Links ]

[3] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995         [ Links ]

[4] R. T. Cox. Probability, frequency and reasonable expectation. American Jornal of Physics 14(1):1-13, 1946.         [ Links ]

[5] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Rowet. Hybrid Monte Carlo. Physics Letters B 195(2):216-222, 1987.

[6] D. J. C. MacKay. Bayesian interpolation. Neural Computation 4 (3): 415-447, 1992         [ Links ]

[7] D. J. C. MacKay. The evidence framework applied to classification networks. Neural Computation 4(5):720-736, 1992.         [ Links ]

[8] D. J. C. MacKay. A practical Bayesian framework for back-propagation networks. Neural Computation 4(3):448-472, 1992.         [ Links ]

[9] D. J. C. MacKay. Bayesian non-linear modelling for the 1993 energy prediction competition. In G. Heidbreder (Ed.), Maximum Entropy and Bayesian Methods, Santa Barbara 1993. Dordrecht: Kluwer, 1995         [ Links ]

[10] N. A. Metropolis, W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. Jornal of Chemical Physics 21(6):1087-1092, 1953.         [ Links ]

[11] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRC-TR-93-1, Department of Computer Science, University of Toronto, Canada, 1993         [ Links ]

[12] R. M. Neal. Bayesian Learning for Neural Networks. Springer. Lecture Notes in Statistics 118, 1996         [ Links ]

[13] C. E. Rasmussen. A pratical Monte Carlo implementation of Bayesian learning. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems, Volume 8: 598-604. Cambridge, MIT Press, 1996         [ Links ]

[14] H. H. Thodberg. Ace of Bayes: application of neural network with pruning. Technical Report 1132E, The Danish Meat Research Institute, Roskilde, Denmark, 1993         [ Links ]

[15] A. M. Walker. On the asymptotic behaviour of posterior distributions. Journal of the Royal Statistical Society, B 31 (1): 80-88, 1969         [ Links ]

[16] C. K. I. Williams, C. Qazaz, C. M. Bishop, and H. Zhu. On the relationship between Bayesian error bars and the input data density. In Proceedings Fourth IEE International Conference on Artificial Neural Networks, pp. 160-165. 1995         [ Links ]

1 The extension to multiple output variables is straightforward.

2 Note that we will refer to b as a hyperparameter since, although it does not control the distribution of other parameters in the way that a does, it can be treated by similar techniques.