Cross-Validation
Cross-Validation
Cross-Validation is a resampling technique that is often used for model selection and estimation of the prediction error of a classification- or regression function. We have seen already that squared error is a natural measure of prediction error for regression functions:
PE = E(y − ˆ f)2
Estimating prediction error on the same data used for model estimation tends to give downward-biased estimates because the parameter estimates are “fine-tuned” to the peculiarities of the sample. For very flexible methods, e.g. neural networks or tree-based models, the error on the training sample can usually be made close to zero. The true error of such a model will usually be much higher, however: the model has been “overfitted” to the training sample. One way of dealing with this problem is to include a penalty term for model complexity (e.g. AIC, BIC). An alternative is to divide the available data into a training sample and a test sample and to estimate the prediction error on the test sample. If the available sample is rather small, this method is not preferred because the test sample may not be used for model estimation in this scenario. Cross-validation accomplishes that all data points are used for training as well as testing. The general K-fold cross-validation procedure
works as follows:
1. Split the data into K roughly equal-sized parts.
2. For the kth part, estimate the model on the other K − 1 parts, and calculates its prediction
error on the kth part of the data.
3. Do the above for k = 1, 2, . . . ,K and combine the K estimates of prediction error.
If K = n, we have the so-called leave-one-out cross-validation: one observation is left out at a time, and ˆ f is computed on the remaining n – 1 observation.
Now let k(i) be the part containing observation i. Denote by ˆ f−k(i) i the value predicted for observation i by the model estimated from the data with the k(i)th part removed.
Comments
Post a Comment