Tokenization
Short Definition
Full Definition
Tokenization is a foundational preprocessing step in natural language processing that directly impacts the performance, efficiency, and capabilities of language models. Before any text can be processed by an AI model, it must be converted into a sequence of discrete tokens that the model can understand. Early NLP systems used word-level tokenization, splitting text at spaces and punctuation. However, this approach struggles with unknown words, morphologically rich languages, and creates extremely large vocabularies. Modern language models use subword tokenization methods that balance vocabulary size with the ability to represent any text. Byte-Pair Encoding (BPE), introduced for neural machine translation, iteratively merges the most frequent character pairs to build a vocabulary of subword units. SentencePiece and WordPiece are variations used by different model families. The tokenizer determines how many tokens are needed to represent any given text, which directly affects both cost (most LLM APIs charge per token) and context window usage. Different models use different tokenizers: GPT models use a BPE variant called tiktoken, while Claude uses its own tokenizer. A word like ‘unhappiness’ might be tokenized as ‘un’ + ‘happiness’ or ‘un’ + ‘happ’ + ‘iness’ depending on the tokenizer. Tokenization significantly affects multilingual performance — languages like Chinese, Japanese, Korean, Arabic, and Hindi often require more tokens per concept than English, leading to higher costs and reduced effective context length. Current research explores tokenizer-free models and improved multilingual tokenization strategies.
Technical Explanation
Byte-Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent adjacent pair: starting with [‘l’,’o’,’w’] and if ‘lo’ is most frequent, merge to [‘lo’,’w’]. The process repeats until reaching the desired vocabulary size (typically 32K-100K tokens). WordPiece (used by BERT) uses likelihood-based merging rather than frequency. SentencePiece operates on raw text without pre-tokenization, treating the input as a sequence of bytes. The tokenizer maintains a vocabulary mapping tokens to integer IDs: token_ids = tokenizer.encode(‘Hello world’) -> [15496, 995]. Special tokens mark sequence boundaries: [CLS], [SEP], [PAD], , . The embedding layer maps token IDs to dense vectors: E(token_id) -> R^d. Vocabulary size affects model size through the embedding matrix: V x d parameters, where V is vocabulary size and d is embedding dimension.
Use Cases
Advantages
Disadvantages
Schema Type
Featured Snippet Candidate
Difficulty Level