Gradient Descent

Short Definition

Gradient descent is the fundamental optimization algorithm used to train machine learning models by iteratively adjusting parameters in the direction that minimizes the loss function. It computes the gradient of the loss with respect to each parameter and takes steps proportional to the negative gradient.

Full Definition

Gradient descent is the workhorse optimization algorithm of machine learning, responsible for training virtually every neural network and many other model types. The algorithm works by iteratively computing the gradient (the multi-dimensional derivative) of the loss function with respect to the model parameters, then updating the parameters in the direction that reduces the loss. Imagine standing on a hilly landscape in dense fog — gradient descent is like always taking a step in the steepest downhill direction to find the valley. The concept dates back to Cauchy in 1847, but its application to machine learning became practical with the development of backpropagation. There are three main variants: batch gradient descent (uses the entire dataset for each update), stochastic gradient descent or SGD (uses a single random sample), and mini-batch gradient descent (uses a small random subset). Mini-batch SGD is the most commonly used in practice as it balances computational efficiency with stable convergence. The learning rate is the most critical hyperparameter, controlling the step size. Too large a learning rate causes divergence, while too small a rate leads to extremely slow convergence. Modern optimizers like Adam, AdaGrad, and RMSprop adaptively adjust learning rates per parameter, improving convergence on complex loss landscapes. Understanding gradient descent is essential for anyone working with machine learning, as it underlies the training of everything from simple linear regression to the largest language models with trillions of parameters.

Technical Explanation

The gradient descent update rule is: w_new = w_old – alpha * gradient(L(w)), where alpha is the learning rate and L is the loss function. For mini-batch SGD, the gradient is estimated from a random subset B: gradient ≈ (1/|B|) * sum_{i in B} gradient(L_i(w)). Momentum adds a velocity term: v = beta*v – alpha*gradient; w = w + v. Adam optimizer combines momentum with adaptive learning rates: m_t = beta1*m_{t-1} + (1-beta1)*g_t (first moment), v_t = beta2*v_{t-1} + (1-beta2)*g_t^2 (second moment), then updates using bias-corrected estimates. Learning rate scheduling strategies include step decay, cosine annealing, and warm-up followed by decay. Gradient clipping prevents exploding gradients by capping the gradient norm. Convergence guarantees exist for convex functions, but neural network loss surfaces are highly non-convex.

Use Cases

Training neural networks | Linear and logistic regression | Support vector machines | Recommendation system optimization | Reinforcement learning policy updates | Hyperparameter optimization | Scientific model fitting | Financial model calibration

Advantages

Simple and intuitive algorithm | Works with any differentiable loss function | Scales to billions of parameters | Well-understood convergence properties for convex functions | Many efficient variants available | Foundation of all deep learning training

Disadvantages

Can converge to local minima in non-convex landscapes | Sensitive to learning rate selection | Slow convergence on ill-conditioned problems | Saddle points can stall optimization | Requires differentiable loss functions | Mini-batch noise can hinder fine convergence

Schema Type

DefinedTerm

Difficulty Level

Beginner