Retrieval-Augmented Generation
Short Definition
Full Definition
Retrieval-Augmented Generation has become one of the most important practical techniques in applied AI, addressing the fundamental limitation that language models can only access knowledge learned during training. RAG combines the generative capabilities of large language models with the precision of information retrieval systems, creating a powerful hybrid approach. The concept was introduced by Lewis et al. at Facebook AI Research in 2020 and has since become the standard approach for building knowledge-intensive AI applications. The RAG pipeline works in three stages: first, the user’s query is converted into an embedding and used to search a knowledge base for relevant documents; second, the retrieved documents are added to the prompt as context; third, the language model generates a response grounded in this retrieved information. This approach offers several critical advantages over using LLMs alone. It enables access to information that was not in the training data, including private organizational knowledge, real-time data, and domain-specific documents. It significantly reduces hallucinations by grounding responses in verified sources. It allows easy updating of the knowledge base without retraining the model. And it enables attribution and citation of sources, improving trustworthiness. RAG has become the foundation for enterprise AI assistants, customer support systems, research tools, and any application where accuracy and currency of information are critical. Advanced RAG techniques include query rewriting, hypothetical document embeddings, multi-step retrieval, re-ranking, and agentic RAG where the model decides when and what to retrieve.
Technical Explanation
The RAG pipeline: 1) Indexing: documents are chunked, embedded using a model like text-embedding-3-large, and stored in a vector database. 2) Retrieval: user query is embedded and k nearest neighbors are found using cosine similarity or approximate nearest neighbor search (ANN). 3) Generation: retrieved chunks are prepended to the prompt as context for the LLM. Key hyperparameters include chunk size (typically 256-1024 tokens), chunk overlap, number of retrieved documents k, and similarity threshold. Advanced techniques include hybrid search (combining vector similarity with keyword BM25 search), re-ranking (using a cross-encoder to score relevance of retrieved chunks), query expansion (generating multiple query variants), HyDE (generating a hypothetical answer to use as a retrieval query), multi-hop retrieval (iteratively retrieving and reasoning), and parent document retrieval (embedding small chunks but retrieving larger parent documents for context).
Use Cases
Advantages
Disadvantages
Schema Type
Featured Snippet Candidate
Difficulty Level