Tokenization
The process of splitting text into tokens (words, subwords, or other processing units).
Tokenization is the process of splitting text into tokens, the basic units of processing. Tokens can be words, subwords, or characters, depending on the application.
Large language models (LLMs) use subword tokenizers like BPE (Byte Pair Encoding) and SentencePiece to handle unknown words. ChatGPT's input/output limits are managed by token count. NLP tokenization guides cover various tokenization methods.
Japanese tokenization is closely related to morphological analysis, using tools like MeCab and Sudachi. English tokenization primarily relies on spaces and punctuation.
Character counting tools that estimate token counts are valuable for LLM users managing prompt lengths. LLM tokenizer books teach the latest techniques.