Neural Network Toolbox Previous page   Next Page

Summary and Discussion

Both regularization and early stopping can ensure network generalization when properly applied.

When using Bayesian regularization, it is important to train the network until it reaches convergence. The sum squared error, the sum squared weights, and the effective number of parameters should reach constant values when the network has converged.

For early stopping, you must be careful not to use an algorithm that converges too rapidly. If you are using a fast algorithm (like trainlm), you want to set the training parameters so that the convergence is relatively slow (e.g., set mu to a relatively large value, such as 1, and set mu_dec and mu_inc to values close to 1, such as 0.8 and 1.5, respectively). The training functions trainscg and trainrp usually work well with early stopping.

With early stopping, the choice of the validation set is also important. The validation set should be representative of all points in the training set.

With both regularization and early stopping, it is a good idea to train the network starting from several different initial conditions. It is possible for either method to fail in certain circumstances. By testing several different initial conditions, you can verify robust network performance.

Based on our experience, Bayesian regularization generally provides better generalization performance than early stopping, when training function approximation networks. This is because Bayesian regularization does not require that a validation data set be separated out of the training data set. It uses all of the data. This advantage is especially noticeable when the size of the data set is small.

To provide you with some insight into the performance of the algorithms, we tested both early stopping and Bayesian regularization on several benchmark data sets, which are listed in the following table.

Data Set Title
No. pts.
Network
Description
BALL
   67
2-10-1
Dual-sensor calibration for a ball position measurement.
SINE (5% N)
   41
1-15-1
Single-cycle sine wave with Gaussian noise at 5% level.
SINE (2% N)
   41
1-15-1
Single-cycle sine wave with Gaussian noise at 2% level.
ENGINE (ALL)
1199
2-30-2
Engine sensor - full data set.
ENGINE (1/4)
  300
2-30-2
Engine sensor - 1/4 of data set.
CHOLEST (ALL)
264
5-15-3
Cholesterol measurement - full data set.
CHOLEST (1/2)
132
5-15-3
Cholesterol measurement - 1/2 data set.

These data sets are of various sizes, with different numbers of inputs and targets. With two of the data sets we trained the networks once using all of the data and then retrained the networks using only a fraction of the data. This illustrates how the advantage of Bayesian regularization becomes more noticeable when the data sets are smaller. All of the data sets are obtained from physical systems, except for the SINE data sets. These two were artificially created by adding various levels of noise to a single cycle of a sine wave. The performance of the algorithms on these two data sets illustrates the effect of noise.

The following table summarizes the performance of Early Stopping (ES) and Bayesian Regularization (BR) on the seven test sets. (The trainscg algorithm was used for the early stopping tests. Other algorithms provide similar performance.)

Method
Ball
Engine (All)
Engine (1/4)
Choles (All)
Choles (1/2)
Sine (5% N)
Sine (2% N)
ES
1.2e-1
1.3e-2
1.9e-2
1.2e-1
1.4e-1
1.7e-1
1.3e-1
BR
1.3e-3
2.6e-3
4.7e-3
1.2e-1
9.3e-2
3.0e-2
6.3e-3
ES/BR
92
5
4
1
1.5
5.7
21
Mean Squared Test Set Error

We can see that Bayesian regularization performs better than early stopping in most cases. The performance improvement is most noticeable when the data set is small, or if there is little noise in the data set. The BALL data set, for example, was obtained from sensors that had very little noise.

Although the generalization performance of Bayesian regularization is often better than early stopping, this is not always the case. In addition, the form of Bayesian regularization implemented in the toolbox does not perform as well on pattern recognition problems as it does on function approximation problems. This is because the approximation to the Hessian that is used in the Levenberg-Marquardt algorithm is not as accurate when the network output is saturated, as would be the case in pattern recognition problems. Another disadvantage of the Bayesian regularization method is that it generally takes longer to converge than early stopping.


Previous page  Early Stopping Preprocessing and Postprocessing Next page

© 1994-2005 The MathWorks, Inc.