Johnny Deng's Column: K-fold Cross Validation

Cross validation is a method for estimating the true error of a model. When a model is built from training data, the error on the training data is a rather optimistic estimate of the error rates the model will achieve on unseen data. The aim of building a model is usually to apply the model to new, unseen data--we expect the model to generalise to data other than the training data on which it was built. Thus, we would like to have some method for better approximating the error that might occur in general. Cross validation provides such a method.

Cross validation is also used to evaluate a model in deciding which algorithm to deploy for learning, when choosing from amongst a number of learning algorithms. It can also provide a guide as to the effect of parameter tuning in building a model from a specific algorithm.

Test sample cross-validation is often a preferred method when there is plenty of data available. A model is built from a training set and its predictive accuracy is measured by applying the model a test set. A good rule of thumb is that a dataset is partitioned into a training set (66%) and a test set (33%).

To measure error rates you might build multiple models with the one algorithm, using variations of the same training data for each model. The average performance is then the measure of how well this algorithm works in building models from the data.

The basic idea is to use, say, 90% of the dataset to build a model. The data that was removed (the 10%) is then used to test the performance of the model on ``new'' data (usually by calculating the mean squared error). This simplest of cross validation approaches is referred to as the holdout method.

For the holdout method the two datasets are referred to as the training set and the test set. With just a single evaluation though there can be a high variance since the evaluation is dependent on the data points which happen to end up in the training set and the test set. Different partitions might lead to different results.

A solution to this problem is to have multiple subsets, and each time build the model based on all but one of these subsets. This is repeated for all possible combinations and the result is reported as the average error over all models.

In k-fold cross validation, sometimes called rotation estimation, the dataset D is randomly split into k mutually exclusive subsets (the folds) D1 ; D2 ; ...; Dk of approximately equal size. The inducer is trained and tested k times; each time t from {1; 2;... ; k}, it is trained on D - Dt and tested on Dt . The crossvalidation estimate of accuracy is the overall number of correct classifica
tions, divided by the number of instances in the dataset. Formally, let D (i) be the test set that includes instance x i = (vi ; yi), then the crossvalidation estimate of accuracy

The crossvalidation estimate is a random number that depends on the division into folds. Complete cross validation is the average of all

possibilities for choosing m-k instances out of m, but it is usually too expensive. Except for leaveoneone (nfold crossvalidation), which is always complete, kfold cross validation is estimating complete kfold crossvalidation using a single split of the data into the folds. Repeating crossvalidation multiple times using different splits into folds provides a better MonteCarlo estimate to the complete crossvalidation at an added cost. In stratified crossvalidation, the folds are stratified so that they contain approximately the same proportions of labels as the original dataset.

Johnny Deng's Column

Thursday, 4 October 2007

K-fold Cross Validation

No comments:

Site Search

Blog Archive

Who am I?

Access History