Tokenization
Quá trình tách văn bản thành các token (từ, từ phụ hoặc các đơn vị xử lý khác).
Tokenization là quá trình tách văn bản thành token, đơn vị xử lý cơ bản. Đây là bước tiền xử lý thiết yếu for natural language processing (NLP) and large language models (LLMs), where the granularity and method of splitting significantly influence downstream processing accuracy. Token units can be words, subwords, characters, or bytes depending on the purpose, and the same text can yield different results depending on the tokenizer used.
Major tokenization methods include word-level splitting, subword splitting, and character-level splitting. Word-level splitting is intuitive for space-delimited languages like English but cannot handle out-of-vocabulary (OOV) words. Subword splitting solves this problem, with algorithms like BPE (Byte Pair Encoding), WordPiece, SentencePiece, and Unigram being widely used. BPE builds vocabulary by repeatedly merging frequent character pairs and is adopted in the GPT series tokenizer (tiktoken). NLP tokenization guides provide systematic coverage of these methods.
Japanese tokenization is considerably more complex than English. Since Japanese lacks spaces between words, morphological analyzers (MeCab, Sudachi, Janome, etc.) are needed to estimate word boundaries. For LLMs, language-agnostic subword tokenizers like SentencePiece are commonly used, splitting Japanese kanji and hiragana into fine-grained subwords. For example, "東京都" (Tokyo-to) might be split into "東京" + "都" or "東" + "京" + "都" depending on the tokenizer's vocabulary size.
In practice, tokenization is particularly important for managing LLM usage costs. APIs for ChatGPT, Claude, and similar services charge based on input/output token counts, so prompts with fewer tokens cost less for the same content. Japanese tends to be less token-efficient than English (the same meaning requires more tokens), with a single Japanese character sometimes splitting into 1 to 3 tokens. This makes concise prompt design crucial when working with Japanese text. LLM tokenizer books cover the latest techniques.
A common misconception is equating character count with token count. In reality, English words typically map to 1-2 tokens, while Japanese characters can become multiple tokens, so character count and token count do not match. Additionally, tokenizers differ between models, meaning the same text produces different token counts in GPT-4 versus Claude. Accurate token counting requires using each model's dedicated tokenizer.
Regarding character counting, text "length" is measured using three scales in practice: character count, byte count, and token count. Social media post limits use character count, database column sizes use byte count, and LLM input/output limits use token count. Adding token estimation to character counting tools helps LLM users verify prompt length in advance and optimize API usage costs.