Hi all, This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.

Wednesday, 28 November 2007

What is overfitting and how can I avoid it?


The critical issue in developing a neural network is generalization: how
well will the network make predictions for cases that are not in the
training set? NNs, like other flexible nonlinear estimation methods such as
kernel regression and smoothing splines, can suffer from either underfitting
or overfitting. A network that is not sufficiently complex can fail to
detect fully the signal in a complicated data set, leading to underfitting.
A network that is too complex may fit the noise, not just the signal,
leading to overfitting. Overfitting is especially dangerous because it can
easily lead to predictions that are far beyond the range of the training
data with many of the common types of NNs. Overfitting can also produce wild
predictions in multilayer perceptrons even with noise-free data.

For an elementary discussion of overfitting, see Smith (1996). For a more
rigorous approach, see the article by Geman, Bienenstock, and Doursat (1992)
on the bias/variance trade-off (it's not really a dilemma). We are talking
about statistical bias here: the difference between the average value of an
estimator and the correct value. Underfitting produces excessive bias in the
outputs, whereas overfitting produces excessive variance. There are
graphical examples of overfitting and underfitting in Sarle (1995, 1999).

The best way to avoid overfitting is to use lots of training data. If you
have at least 30 times as many training cases as there are weights in the
network, you are unlikely to suffer from much overfitting, although you may
get some slight overfitting no matter how large the training set is. For
noise-free data, 5 times as many training cases as weights may be
sufficient. But you can't arbitrarily reduce the number of weights for fear
of underfitting.

Given a fixed amount of training data, there are at least six approaches to
avoiding underfitting and overfitting, and hence getting good
generalization:

o Model selection
o Jittering
o Early stopping
o Weight decay
o Bayesian learning
o Combining networks

The first five approaches are based on well-understood theory. Methods for
combining networks do not have such a sound theoretical basis but are the
subject of current research. These six approaches are discussed in more
detail under subsequent questions.

The complexity of a network is related to both the number of weights and the
size of the weights. Model selection is concerned with the number of
weights, and hence the number of hidden units and layers. The more weights
there are, relative to the number of training cases, the more overfitting
amplifies noise in the targets (Moody 1992). The other approaches listed
above are concerned, directly or indirectly, with the size of the weights.
Reducing the size of the weights reduces the "effective" number of
weights--see Moody (1992) regarding weight decay and Weigend (1994)
regarding early stopping. Bartlett (1997) obtained learning-theory results
in which generalization error is related to the L_1 norm of the weights
instead of the VC dimension.

Overfitting is not confined to NNs with hidden units. Overfitting can occur
in generalized linear models (networks with no hidden units) if either or
both of the following conditions hold:

1. The number of input variables (and hence the number of weights) is large
with respect to the number of training cases. Typically you would want at
least 10 times as many training cases as input variables, but with
noise-free targets, twice as many training cases as input variables would
be more than adequate. These requirements are smaller than those stated
above for networks with hidden layers, because hidden layers are prone to
creating ill-conditioning and other pathologies.

2. The input variables are highly correlated with each other. This condition
is called "multicollinearity" in the statistical literature.
Multicollinearity can cause the weights to become extremely large because
of numerical ill-conditioning--see "How does ill-conditioning affect NN
training?"

Methods for dealing with these problems in the statistical literature
include ridge regression (similar to weight decay), partial least squares
(similar to Early stopping), and various methods with even stranger names,
such as the lasso and garotte (van Houwelingen and le Cessie, ????).

References:

Bartlett, P.L. (1997), "For valid generalization, the size of the weights
is more important than the size of the network," in Mozer, M.C., Jordan,
M.I., and Petsche, T., (eds.) Advances in Neural Information Processing
Systems 9, Cambrideg, MA: The MIT Press, pp. 134-140.

Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
the Bias/Variance Dilemma", Neural Computation, 4, 1-58.

Moody, J.E. (1992), "The Effective Number of Parameters: An Analysis of
Generalization and Regularization in Nonlinear Learning Systems", in
Moody, J.E., Hanson, S.J., and Lippmann, R.P., Advances in Neural
Information Processing Systems 4, 847-854.

Sarle, W.S. (1995), "Stopped Training and Other Remedies for
Overfitting," Proceedings of the 27th Symposium on the Interface of
Computing Science and Statistics, 352-360,
ftp://ftp.sas.com/pub/neural/inter95.ps.Z (this is a very large
compressed postscript file, 747K, 10 pages)

Sarle, W.S. (1999), "Donoho-Johnstone Benchmarks: Neural Net Results,"
ftp://ftp.sas.com/pub/neural/dojo/dojo.html

Smith, M. (1996). Neural Networks for Statistical Modeling, Boston:
International Thomson Computer Press, ISBN 1-850-32842-0.

van Houwelingen,H.C., and le Cessie, S. (????), "Shrinkage and penalized
likelihood as methods to improve predictive accuracy,"
http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/shrinkage.pdf and
http://www.medstat.medfac.leidenuniv.nl/ms/HH/Files/figures.pdf

Weigend, A. (1994), "On overfitting and the effective number of hidden
units," Proceedings of the 1993 Connectionist Models Summer School,
335-342.

Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC,
USA. Answers provided by other authors as cited below are copyrighted by
those authors, who by submitting the answers for the FAQ give permission for
the answer to be reproduced as part of the FAQ in any of the ways specified
in part 1 of the FAQ.

1 comment:

Will Dwinnell said...

While this is good material, it appears that this "post" is nothing more than a direct copy of the Usenet comp.ai.neural-nets FAQ. ???