Transformer
Short Definition
Full Definition
The Transformer architecture revolutionized natural language processing by replacing recurrent neural networks with self-attention mechanisms. Unlike RNNs that process sequences step by step, Transformers process all tokens simultaneously, enabling massive parallelization during training. The architecture consists of an encoder and decoder, each made up of multiple layers containing multi-head attention and feed-forward networks. This design allows the model to capture long-range dependencies in text far more effectively than previous architectures. Transformers are the backbone of models like GPT, BERT, T5, and LLaMA, and have been extended beyond NLP to computer vision, audio, and multimodal tasks. The original paper ‘Attention Is All You Need’ has become one of the most cited works in AI history, fundamentally changing how sequence data is processed across virtually every domain of artificial intelligence.
Technical Explanation
The Transformer uses scaled dot-product attention: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V. Multi-head attention runs this in parallel across multiple representation subspaces, allowing the model to attend to information from different positions simultaneously. Positional encoding is added to input embeddings since the architecture has no inherent sense of sequence order. The encoder processes input sequences into contextual representations, while the decoder generates output sequences auto-regressively. Layer normalization and residual connections stabilize training of deep networks.
Use Cases
Advantages
Disadvantages
Schema Type
Featured Snippet Candidate
Difficulty Level