BPE (Byte Pair Encoding)

An algorithm that splits text into subword units based on frequently co-occurring byte (character) pairs. Widely adopted as the tokenizer in large language models such as GPT and BERT.

BPE (Byte Pair Encoding) was originally proposed by Philip Gage in 1994 as a data compression algorithm. In 2016, Sennrich et al. adapted it for subword segmentation in machine translation, and it quickly became the standard tokenization method in natural language processing. Most major large language models today, including the GPT series, BERT, and LLaMA, use BPE or a variant of it.

The BPE training algorithm is intuitive. First, the training text is split into individual characters (or bytes). Then, the most frequently adjacent pair is identified and merged into a single new token. This merging process repeats until the vocabulary reaches a target size. For example, given the words "low," "lower," and "lowest," the algorithm might merge "l" + "o" into "lo," then "lo" + "w" into "low," progressively building up common subwords.

A key strength of BPE is that it eliminates the out-of-vocabulary problem. Word-level tokenization cannot handle words absent from the training data. BPE, by contrast, can represent any word as a combination of subwords, so unknown words never arise. The word "unhappiness" might be split into "un" + "happiness" or "un" + "happ" + "iness," with each subword present in the vocabulary.

Applying BPE to Japanese presents unique challenges. Because Japanese does not use spaces to delimit words, there are two main approaches: first performing morphological analysis to segment words and then applying BPE, or learning BPE directly at the character level. GPT-4's tokenizer (cl100k_base) takes the latter approach, where a single kanji may become one token or a sequence of hiragana may be merged into one token.

The relationship between token count and character count varies significantly by language. In English, one token averages about 4 characters, while in Japanese one token corresponds to roughly 1 to 2 characters. This means the same content consumes 2 to 3 times as many tokens in Japanese as in English. GPT-4's 128,000-token limit translates to approximately 500,000 English characters but only about 150,000 to 200,000 Japanese characters. AI and NLP books on Amazon cover tokenization strategies in detail.

From a character-counting perspective, the AI era demands awareness of both character count and token count. When designing ChatGPT prompts, the binding constraint is tokens, not characters. Writing concisely and avoiding redundant phrasing improves token efficiency. Understanding how BPE works helps you estimate which expressions will be tokenized more efficiently.

Share this article