现在的位置: 首页 > 综合 > 正文

Cross Validation in machine learning

2019年09月29日 ⁄ 综合 ⁄ 共 5233字 ⁄ 字号小中大 ⁄ 评论关闭

Cross Validation is a basic method to estimate predictive performance of a supervised learning model on unknown data, i.e. to assess the power of generalization of a model.

Why we need Cross Validation

The primary goal of Cross Validation is to void over-fitting. Empirical Risk will decrease with the rise of the complexity of learning model, however, the generalization power will decrease too, that is, the true performance
in unknown data may get worse. So empirical Risk is not a effective method to evaluate a practical learning model. We need to estimate the actual risk in large independent data set which doesn't have overlap with training data set. Sometimes, the test data
is not available at hand or costly collected. Then, Cross Validation provides an approach to test model in training phase.

How to perform Cross Validation

The basic step for Cross Validation is to split a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation
set). Different Cross Validation methods are defined according to the data partitioning method.

Leave-p-out cross validation

As the name suggests, leave-p-out cross-validation involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set
of p observations and a training set. This cross validation requires to learn and validate $C_{n}^{p}$ times.

k-fold cross validation

In k-fold cross validation, the original sample is randomly partitioned into k equal size sub-samples. Of the k sub-samples, a single sub-sample is retained as the validation data for testing the model, and the remaining k-1
sub-samples are used as training data. The cross validation process is then repeated k times, with each of the k sub-samples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation.

In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the
same proportions of the two types of class labels.

Repeated random sub-sampling validation

This method randomly splits the data set into training and validating data.

Practical application of Cross Validation

Cross Validation are commonly used in model selection, parameter tuning and feature selection

Model Selection

Cross Validation can be used to compare the performance of different predictive modeling procedures. For example, both KNN and SVM can be used in classification problem. Using cross validation, we can objectively compare these
two methods in terms of their respective average precision over different folds.

Parameter Tuning

Usually, there would be one or more hyper-parameters left to set manually, such as K for KNN, $C, r$ for SVM. We can finely tune these parameters by grind search, and then Cross
Validation is performed for each parameter setting. The parameters with best predictive performance will be used to build the final model.

There are two common methods for hyper-parameter tuning:

a) nested cross validation

Inner cross validation for hyper-parameter tuning using grid search, outer cross validation for predictive performance estimation. For each inner cross validation procedure, we will get an optimal value for parameter. If the
model is stable, optimal parameter will be close to each other in each inner cross validation.

b) prior hyper-parameter setting by experience

Hyper-parameters are fixed by experience, then we perform the cross validation to get an unbiased estimate of performance of a possibly sub-optimal model.

More details on parameter tuning, please refer to [4].

Feature Selection

Split data into k equal size folds

for i = 1:k

select k-th fold as validation set, the rest as training set

find the TOP N informative features (i.e. the correlation with the response label) train model with selected features on training set

evaluate the result on validation set

end

calculate the average error rate over each fold, treat the Mean ER as the estimation of performance of the model

Note: the Cross Validation iteration must be most outer loop, any supervised feature selection (using correlation with class labels) performed outside of cross validation may result in over-fitting.

Over-fitting is a potential problem whenever you minimize any statistic based on a finite sample of data, cross validation is no difference [5].

It is best to view cross validation as a procedure of assessing the performance for fitting a model, rather than the model itself. In order to build the final model, you can perform the same procedure used in each fold of the
cross validation on the entire data set.

More Details on how to perform feature selection using cross validation, please refer to [3].

Limitations

The problem with using Cross Validation is that the training and test data are not independent samples (as they share data) which means that the estimate of the variance of the performance estimation and of the hyper-parameters
is likely to be biased (i.e. smaller than it would be for genuinely independent samples of data in each fold).

Rather than repeated Cross Validation, bootstrapping can be used instead.