Cross-Validation -

Short Definition

Cross-validation is a statistical technique used to evaluate machine learning models by partitioning data into multiple subsets, training on some and testing on others. It provides a robust estimate of model performance on unseen data and helps prevent overfitting during model selection.

Full Definition

Cross-validation is one of the most important techniques in machine learning for reliably estimating how well a model will perform on new, unseen data. The fundamental idea is simple: instead of evaluating a model on a single train-test split, you evaluate it on multiple different splits and average the results. The most common form is k-fold cross-validation, where the dataset is divided into k equal parts (folds). The model is trained k times, each time using k-1 folds for training and the remaining fold for validation. The final performance estimate is the average across all k iterations. This approach gives a much more reliable estimate of generalization performance than a single train-test split, which can be optimistic or pessimistic depending on which examples end up in which set. Common choices for k include 5 and 10, with 10-fold being the most widely used. Leave-one-out cross-validation (LOOCV) uses k equal to the number of data points, providing the least biased estimate but at high computational cost. Stratified cross-validation ensures each fold has the same class distribution as the full dataset, which is crucial for imbalanced datasets. Cross-validation is essential for hyperparameter tuning, model selection, and comparing different algorithms fairly.

Technical Explanation

In k-fold CV, the dataset D is split into k disjoint subsets D_1, …, D_k. For each fold i, the model is trained on D D_i and evaluated on D_i. The CV estimate is: CV(k) = (1/k) * sum_{i=1}^{k} L(f_{-i}, D_i), where f_{-i} is the model trained without fold i. For hyperparameter optimization, nested cross-validation uses an outer loop for performance estimation and an inner loop for hyperparameter selection, preventing information leakage. Time series data requires special handling with walk-forward validation to respect temporal ordering. The variance of the CV estimate decreases with more folds but computational cost increases linearly.

Use Cases

Advantages

Disadvantages

Schema Type

DefinedTerm

Featured Snippet Candidate

Difficulty Level

Beginner