Transformer

Short Definition

A Transformer is a deep learning architecture based on self-attention mechanisms, introduced in 2017 by Vaswani et al. It forms the foundation of modern large language models and has revolutionized natural language processing and many other AI domains.

Full Definition

The Transformer architecture revolutionized natural language processing by replacing recurrent neural networks with self-attention mechanisms. Unlike RNNs that process sequences step by step, Transformers process all tokens simultaneously, enabling massive parallelization during training. The architecture consists of an encoder and decoder, each made up of multiple layers containing multi-head attention and feed-forward networks. This design allows the model to capture long-range dependencies in text far more effectively than previous architectures. Transformers are the backbone of models like GPT, BERT, T5, and LLaMA, and have been extended beyond NLP to computer vision, audio, and multimodal tasks. The original paper ‘Attention Is All You Need’ has become one of the most cited works in AI history, fundamentally changing how sequence data is processed across virtually every domain of artificial intelligence.

Technical Explanation

The Transformer uses scaled dot-product attention: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V. Multi-head attention runs this in parallel across multiple representation subspaces, allowing the model to attend to information from different positions simultaneously. Positional encoding is added to input embeddings since the architecture has no inherent sense of sequence order. The encoder processes input sequences into contextual representations, while the decoder generates output sequences auto-regressively. Layer normalization and residual connections stabilize training of deep networks.

Use Cases

Machine translation | Text summarization | Question answering | Code generation | Image recognition | Speech recognition | Protein structure prediction | Music generation

Advantages

Captures long-range dependencies | Highly parallelizable | Scalable to large datasets | State-of-the-art performance across tasks | Flexible architecture adaptable to many domains | Enables transfer learning at scale

Disadvantages

High memory requirements for long sequences | Quadratic complexity with sequence length | Requires large amounts of training data | Computationally expensive to train large models

Schema Type

DefinedTerm

Difficulty Level

Beginner