Diffusion Model
Short Definition
Full Definition
Diffusion models have rapidly become the dominant approach for generative AI, particularly in image generation, surpassing GANs in quality, diversity, and training stability. The concept is elegantly simple: during training, the model learns to reverse a diffusion process that gradually adds Gaussian noise to data until it becomes pure noise. During generation, the model starts with random noise and iteratively denoises it, step by step, to produce a coherent output. This approach was formalized by Sohl-Dickstein et al. in 2015 but gained practical significance with the Denoising Diffusion Probabilistic Model (DDPM) by Ho et al. in 2020. The breakthrough into mainstream awareness came with DALL-E 2, Stable Diffusion, and Midjourney in 2022, which demonstrated unprecedented image generation quality from text descriptions. Diffusion models can generate photorealistic images, artwork in any style, 3D objects, music, speech, video, and even molecular structures for drug design. The architecture typically uses a U-Net or Transformer backbone with cross-attention for conditioning on text or other inputs. Latent diffusion models (like Stable Diffusion) operate in a compressed latent space rather than pixel space, dramatically reducing computational requirements. Current research focuses on faster sampling methods, higher resolution generation, video generation, 3D content creation, and improving controllability and consistency.
Technical Explanation
The forward diffusion process adds Gaussian noise over T timesteps: q(x_t|x_{t-1}) = N(x_t; sqrt(1-beta_t)*x_{t-1}, beta_t*I). The reverse process learns to denoise: p_theta(x_{t-1}|x_t) = N(x_{t-1}; mu_theta(x_t, t), sigma_t^2*I). Training minimizes the simplified objective: L = E[||epsilon – epsilon_theta(x_t, t)||^2], predicting the noise added at each step. Classifier-free guidance improves conditional generation by combining conditional and unconditional predictions: epsilon_guided = epsilon_unconditional + w*(epsilon_conditional – epsilon_unconditional), where w is the guidance scale. DDIM (Denoising Diffusion Implicit Models) enables deterministic sampling with fewer steps. Latent diffusion encodes images to a latent space using a VAE, runs diffusion there, then decodes. Flow matching and consistency models offer alternative training objectives for faster generation.
Use Cases
Advantages
Disadvantages
Schema Type
Featured Snippet Candidate
Difficulty Level