Transformer -

Short Definition

A Transformer is a deep learning architecture based on self-attention mechanisms, introduced in 2017 by Vaswani et al. It forms the foundation of modern large language models and has revolutionized natural language processing and many other AI domains.

Full Definition

The Transformer architecture revolutionized natural language processing by replacing recurrent neural networks with self-attention mechanisms. Unlike RNNs that process sequences step by step, Transformers process all tokens simultaneously, enabling massive parallelization during training. The architecture consists of an encoder and decoder, each made up of multiple layers containing multi-head attention and feed-forward networks. This design allows the model to capture long-range dependencies in text far more effectively than previous architectures. Transformers are the backbone of models like GPT, BERT, T5, and LLaMA, and have been extended beyond NLP to computer vision, audio, and multimodal tasks. The original paper ‘Attention Is All You Need’ has become one of the most cited works in AI history, fundamentally changing how sequence data is processed across virtually every domain of artificial intelligence.

Technical Explanation

The Transformer uses scaled dot-product attention: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V. Multi-head attention runs this in parallel across multiple representation subspaces, allowing the model to attend to information from different positions simultaneously. Positional encoding is added to input embeddings since the architecture has no inherent sense of sequence order. The encoder processes input sequences into contextual representations, while the decoder generates output sequences auto-regressively. Layer normalization and residual connections stabilize training of deep networks.

Use Cases

Advantages

Disadvantages

High memory requirements for long sequences | Quadratic complexity with sequence length | Requires large amounts of training data | Computationally expensive to train large models

Schema Type

DefinedTerm

Featured Snippet Candidate

Difficulty Level

Beginner