Ensemble Methods
Short Definition
Full Definition
Ensemble methods represent one of the most practical and powerful strategies in machine learning, consistently delivering top performance across a wide range of tasks. The fundamental insight is that combining multiple diverse models reduces both bias and variance, leading to more accurate and stable predictions. There are three main categories of ensemble methods. Bagging (Bootstrap Aggregating) trains multiple models on random subsets of the data and averages their predictions, with Random Forest being the most famous example. Boosting trains models sequentially, with each new model focusing on the mistakes of previous ones, with Gradient Boosting (XGBoost, LightGBM, CatBoost) and AdaBoost as key examples. Stacking trains multiple different algorithms and uses a meta-learner to combine their predictions optimally. The success of ensemble methods is well-established both theoretically and empirically. They consistently dominate machine learning competitions — most Kaggle winning solutions use some form of ensembling. In production systems, ensembles are used when prediction accuracy is paramount and computational cost is acceptable. Netflix Prize, one of the most famous ML competitions, was won by an ensemble of over 800 models. Even in the age of deep learning, ensemble techniques remain relevant: test-time augmentation, model averaging, and mixture of experts are all ensemble strategies used with neural networks.
Technical Explanation
Bagging reduces variance by averaging predictions: f_bag(x) = (1/B) * sum(f_b(x)), where each f_b is trained on a bootstrap sample. Boosting reduces bias by sequential correction: F_m(x) = F_{m-1}(x) + eta * h_m(x), where h_m fits the residuals. AdaBoost reweights misclassified samples: w_i(m+1) = w_i(m) * exp(alpha_m * I(y_i != h_m(x_i))). Stacking uses a meta-learner: f_stack(x) = g(f_1(x), f_2(x), …, f_K(x)), where g is typically a simple linear model or logistic regression trained on out-of-fold predictions to avoid overfitting. The bias-variance decomposition explains why ensembles work: for M uncorrelated models with variance sigma^2, the ensemble variance is sigma^2/M.
Use Cases
Advantages
Disadvantages
Schema Type
Featured Snippet Candidate
Difficulty Level