Hi all, This Blog is an English archive of my PhD experience in Imperial College London, mainly logging my research and working process, as well as some visual records.

Thursday 4 October 2007

K-fold Cross Validation

Cross validation is a method for estimating the true error of a model. When a model is built from training data, the error on the training data is a rather optimistic estimate of the error rates the model will achieve on unseen data. The aim of building a model is usually to apply the model to new, unseen data--we expect the model to generalise to data other than the training data on which it was built. Thus, we would like to have some method for better approximating the error that might occur in general. Cross validation provides such a method.

Cross validation is also used to evaluate a model in deciding which algorithm to deploy for learning, when choosing from amongst a number of learning algorithms. It can also provide a guide as to the effect of parameter tuning in building a model from a specific algorithm.

Test sample cross-validation is often a preferred method when there is plenty of data available. A model is built from a training set and its predictive accuracy is measured by applying the model a test set. A good rule of thumb is that a dataset is partitioned into a training set (66%) and a test set (33%).

To measure error rates you might build multiple models with the one algorithm, using variations of the same training data for each model. The average performance is then the measure of how well this algorithm works in building models from the data.

The basic idea is to use, say, 90% of the dataset to build a model. The data that was removed (the 10%) is then used to test the performance of the model on ``new'' data (usually by calculating the mean squared error). This simplest of cross validation approaches is referred to as the holdout method.

For the holdout method the two datasets are referred to as the training set and the test set. With just a single evaluation though there can be a high variance since the evaluation is dependent on the data points which happen to end up in the training set and the test set. Different partitions might lead to different results.

A solution to this problem is to have multiple subsets, and each time build the model based on all but one of these subsets. This is repeated for all possible combinations and the result is reported as the average error over all models.


In k-­fold cross ­validation, sometimes called rotation estimation, the dataset D is randomly split into k mutually exclusive subsets (the folds) D1 ; D2 ; ...; Dk of approximately equal size. The inducer is trained and tested k times; each time t from {1; 2;... ; k}, it is trained on D - Dt and tested on Dt . The cross­validation estimate of accuracy is the overall number of correct classifica­
tions, divided by the number of instances in the dataset. Formally, let D (i) be the test set that includes instance x i = (vi ; yi), then the cross­validation estimate of accuracy
The cross­validation estimate is a random number that depends on the division into folds. Complete cross ­validation is the average of all

possibilities for choosing m-k instances out of m, but it is usually too expensive. Except for leave­one­one (n­fold cross­validation), which is always complete, k­fold cross­ validation is estimating complete k­fold cross­validation using a single split of the data into the folds. Repeat­ing cross­validation multiple times using different splits into folds provides a better Monte­Carlo estimate to the complete cross­validation at an added cost. In strati­fied cross­validation, the folds are stratified so that they contain approximately the same proportions of la­bels as the original dataset.

No comments: