What is Tokenization?

Natural Language Processing

Tokenization

Tokenization is the process of splitting text into tokens that a language model can process. Modern tokenizers like BPE and SentencePiece balance vocabulary size with the ability to represent any text sequence.

Understanding Tokenization

Tokenization is the process of breaking raw text into smaller units called tokens that serve as the input vocabulary for language models. Modern tokenization algorithms like Byte Pair Encoding, WordPiece, and SentencePiece operate at the subword level, balancing vocabulary size with the ability to represent any text including rare words and multiple languages. Effective tokenization directly impacts model performance, training efficiency, and multilingual capabilities. A well-designed tokenizer produces consistent, meaningful units that help the model learn linguistic patterns efficiently. Tokenization also determines the effective context length of a model, since context windows are measured in tokens rather than words or characters. The choice of tokenizer is tightly coupled with model architecture and pre-training, meaning that each large language model family typically has its own specific tokenizer.

Is AI recommending your brand?

Find out if ChatGPT, Perplexity, and Gemini mention you when people search your industry.

Check your brand — $9

Related Natural Language Processing Terms

Abstractive Summarization

Abstractive summarization generates new text that captures the key points of a longer document, rather than simply extracting existing sentences. It requires deep language understanding and generation capabilities.

Beam Search

Beam search is a decoding algorithm that explores multiple candidate sequences simultaneously, keeping only the top-k most promising at each step. It balances between greedy decoding and exhaustive search in text generation.

BERT

BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that reads text in both directions simultaneously. BERT revolutionized NLP by enabling deep bidirectional pre-training for language understanding tasks.

Tool Use

Back to full glossary

Tokenization

Understanding Tokenization

Is AI recommending your brand?

Related Natural Language Processing Terms

Abstractive Summarization

Beam Search

BERT

Bigram

Byte Pair Encoding

Corpus

Extractive Summarization

Grounding