Token
The smallest unit of text processing. LLMs use their own tokenization schemes that differ from characters or words.
A token is the smallest unit of text that a language model processes. In natural language processing (NLP) and large language models (LLMs), text is split into tokens (tokenized) before processing.
In LLMs like ChatGPT, one English token is roughly 4 characters or 0.75 words. Japanese text uses more tokens per character — about 1-2 characters per token. GPT-4o has a 128K token context window, equivalent to about 96,000 English words. NLP introduction books explain tokenization mechanisms in detail.
API pricing is based on token count, making prompt optimization directly tied to cost management.
Japanese text is less token-efficient than English, meaning the same content consumes more tokens. ChatGPT prompt engineering books cover token-aware prompt design strategies.