Tokenization

Short Definition

Tokenization is the process of breaking down text into smaller units called tokens, which serve as the basic building blocks that language models process. These tokens can be words, subwords, characters, or byte-level units, depending on the tokenization method used.

Full Definition

Tokenization is a foundational preprocessing step in natural language processing that directly impacts the performance, efficiency, and capabilities of language models. Before any text can be processed by an AI model, it must be converted into a sequence of discrete tokens that the model can understand. Early NLP systems used word-level tokenization, splitting text at spaces and punctuation. However, this approach struggles with unknown words, morphologically rich languages, and creates extremely large vocabularies. Modern language models use subword tokenization methods that balance vocabulary size with the ability to represent any text. Byte-Pair Encoding (BPE), introduced for neural machine translation, iteratively merges the most frequent character pairs to build a vocabulary of subword units. SentencePiece and WordPiece are variations used by different model families. The tokenizer determines how many tokens are needed to represent any given text, which directly affects both cost (most LLM APIs charge per token) and context window usage. Different models use different tokenizers: GPT models use a BPE variant called tiktoken, while Claude uses its own tokenizer. A word like ‘unhappiness’ might be tokenized as ‘un’ + ‘happiness’ or ‘un’ + ‘happ’ + ‘iness’ depending on the tokenizer. Tokenization significantly affects multilingual performance — languages like Chinese, Japanese, Korean, Arabic, and Hindi often require more tokens per concept than English, leading to higher costs and reduced effective context length. Current research explores tokenizer-free models and improved multilingual tokenization strategies.

Technical Explanation

Byte-Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent adjacent pair: starting with [‘l’,’o’,’w’] and if ‘lo’ is most frequent, merge to [‘lo’,’w’]. The process repeats until reaching the desired vocabulary size (typically 32K-100K tokens). WordPiece (used by BERT) uses likelihood-based merging rather than frequency. SentencePiece operates on raw text without pre-tokenization, treating the input as a sequence of bytes. The tokenizer maintains a vocabulary mapping tokens to integer IDs: token_ids = tokenizer.encode(‘Hello world’) -> [15496, 995]. Special tokens mark sequence boundaries: [CLS], [SEP], [PAD], , . The embedding layer maps token IDs to dense vectors: E(token_id) -> R^d. Vocabulary size affects model size through the embedding matrix: V x d parameters, where V is vocabulary size and d is embedding dimension.

Use Cases

Preprocessing text for language models | API cost estimation and optimization | Context window management | Multilingual text processing | Code tokenization | Search engine indexing | Text classification preprocessing | Sentiment analysis pipeline

Advantages

Handles unknown words through subword decomposition | Balances vocabulary size with coverage | Language-agnostic with byte-level methods | Efficient representation of common words | Enables processing of any text input | Well-established algorithms with proven performance

Disadvantages

Can fragment meaningful words into unclear subwords | Multilingual tokenization often favors English | Token counts vary significantly across languages | Tokenizer choice is irreversible after model training | Can introduce artifacts in model behavior | Byte-level tokenization increases sequence length

Schema Type

DefinedTerm

Difficulty Level

Beginner