Token

Đơn vị nhỏ nhất của xử lý văn bản. LLM sử dụng sơ đồ tokenization riêng khác với ký tự hoặc từ.

Token là đơn vị nhỏ nhất để xử lý văn bản. Trong xử lý ngôn ngữ tự nhiên (NLP) và mô hình ngôn ngữ lớn (LLM), văn bản được tách thành token (tokenize) trước khi xử lý. Tokens are a unique unit distinct from both characters and words, with segmentation methods varying by language and model. With the proliferation of LLMs, the concept of tokens has become widely known not just among programmers but also general users.

Tokenizers used in LLMs like ChatGPT (BPE: Byte Pair Encoding) learn frequently occurring string patterns as single tokens. In English, one token corresponds to approximately 4 characters (about 0.75 words): "Hello" is 1 token, while "indistinguishable" splits into 4 tokens. In Japanese, one token corresponds to approximately 1-3 characters, with hiragana and katakana typically being 1-2 characters per token and kanji often being 1 character per token. GPT-4o's context window is 128K tokens, equivalent to approximately 96,000 English words or 64,000-128,000 Japanese characters. NLP introduction books explain tokenization mechanisms.

API pricing is calculated based on token count, making token management directly tied to cost control. OpenAI's API charges separately for input and output tokens, and prompt optimization (removing unnecessary context, writing concise instructions) is key to cost reduction. The tiktoken library enables pre-calculating token counts before API calls, preventing context window overflow.

Japanese tends to have lower token efficiency compared to English. The same meaning expressed in Japanese consumes more tokens than in English, because BPE tokenizer training data contains predominantly English text. For example, the Japanese sentence for "Tokyo is the capital of Japan" uses approximately 8-10 tokens, while the English version uses about 7 tokens. This difference becomes significant in cost when processing large volumes of text.

The token concept is used beyond LLMs. In programming language compilers, source code is split into tokens through lexical analysis. In this context, tokens are syntactic elements such as keywords, identifiers, operators, and literals. In authentication, data structures containing authentication information are called tokens, as in JWT (JSON Web Token). Since the meaning of "token" varies by domain, clarifying the context is important.

From a character counting perspective, token count and character count are different metrics. Character count represents units humans visually recognize, while token count represents units the model processes. In practical LLM usage, both character limits (such as social media post limits) and token limits (API context windows) must be considered. Adding token count estimation to character counting tools can significantly enhance user convenience. ChatGPT practical guide books cover token-aware prompt design.