Data Augmentation
Short Definition
Full Definition
Data augmentation has become an indispensable technique in modern machine learning, particularly in computer vision and natural language processing. The core principle is simple: by applying various transformations to existing training samples, you create new, slightly different examples that help the model learn more robust and generalizable representations. In computer vision, common augmentations include random cropping, horizontal flipping, rotation, color jittering, scaling, and more advanced techniques like cutout (randomly masking rectangular regions), mixup (blending two images and their labels), and CutMix (cutting and pasting patches between images). These transformations teach the model to be invariant to irrelevant variations in the input. In NLP, augmentation techniques include synonym replacement, random insertion, random swap, random deletion, back-translation (translating to another language and back), and paraphrasing using language models. For tabular data, techniques like SMOTE generate synthetic minority class samples to address class imbalance. The impact of data augmentation on model performance is often dramatic — it can be equivalent to having several times more training data. Modern approaches include learned augmentation policies (AutoAugment, RandAugment) that automatically discover optimal augmentation strategies, and generative augmentation using diffusion models or GANs to create entirely new synthetic training examples.
Technical Explanation
Geometric transformations for images: T(x) applies random rotation, flipping, cropping, scaling. Color augmentations modify brightness, contrast, saturation, hue. Mixup creates virtual training examples: x_new = lambda*x_i + (1-lambda)*x_j, y_new = lambda*y_i + (1-lambda)*y_j, where lambda is drawn from Beta(alpha, alpha). CutMix replaces a rectangular region of one image with a patch from another. RandAugment simplifies augmentation policy search to just two parameters: number of transformations N and magnitude M. For NLP, back-translation: text -> translate to French -> translate back to English produces paraphrases. SMOTE for tabular data generates synthetic minority samples by interpolating between nearest neighbors in feature space.
Use Cases
Advantages
Disadvantages
Schema Type
Featured Snippet Candidate
Difficulty Level