Glossary

Text Measurement

Character Encoding

Unicode

A universal character encoding standard that covers over 140,000 characters from all writing systems worldwide.

UTF-8

A variable-length Unicode encoding. The dominant character encoding on the web, used by over 98% of websites.

Shift_JIS

A Japanese character encoding widely used in legacy systems. Being gradually replaced by UTF-8.

ASCII

A 7-bit character encoding standard representing 128 characters including English letters, digits, and basic symbols.

UTF-16

A Unicode encoding that uses 16-bit code units. Used internally by JavaScript, Java, and Windows.

EUC-JP

A Japanese character encoding widely used on UNIX systems. Part of the Extended Unix Code family.

ISO-2022-JP

A Japanese encoding designed for email. Uses escape sequences to switch between character sets.

BOM (Byte Order Mark)

A byte sequence at the start of a file that identifies the encoding. EF BB BF for UTF-8, FF FE or FE FF for UTF-16.

Code Point

A unique number assigned to each character in Unicode. Written as U+ followed by hexadecimal digits, e.g., U+0041 (A).

Surrogate Pair

A mechanism in UTF-16 to represent characters outside the BMP using two 16-bit code units.

Combining Character

A Unicode character that combines with the preceding base character for display. Includes diacritical marks and dakuten.

Endianness

The byte order of multi-byte data. Two types exist: big-endian and little-endian.

Character Set

A defined collection of characters and their numbering system. ASCII, ISO 8859, and Unicode are representative examples.

GSM-7 Encoding

A 7-bit character encoding used in SMS, allowing 160 characters per message for alphanumeric text and basic symbols.

Mojibake (Character Corruption)

A phenomenon where text displays as garbled symbols or incorrect characters due to a mismatch between the encoding used to write the data and the encoding used to read it.

Character Encoding

A system of rules that maps characters to sequences of bits. It consists of two layers: a character set (which characters are included) and an encoding scheme (how those characters are converted to byte sequences).

JIS (Japanese Industrial Standards)

Japan's national standards for industrial products. In the character encoding domain, JIS X 0208 and JIS X 0213 form the foundation of Japanese text processing.

Variable-Length Encoding

An encoding scheme where different characters use different numbers of bytes. UTF-8 and Shift_JIS are representative examples, achieving efficiency by representing frequently used characters with shorter byte sequences.

BMP (Basic Multilingual Plane)

The first 65,536 code points (U+0000 to U+FFFF) in Unicode. Most characters used in everyday writing are found here; characters outside this range require surrogate pairs in UTF-16.

Character Types

Full-Width Character

A character that occupies twice the width of a half-width character in fixed-width fonts. Common in CJK text.

Half-Width Character

A character that occupies half the width of a full-width character in fixed-width fonts. ASCII characters are half-width.

Hiragana

One of the Japanese phonetic writing systems. Used for native words, particles, and verb endings.

Katakana

One of the Japanese phonetic writing systems. Used for loanwords, onomatopoeia, and scientific terms.

Kanji

Logographic characters originating from China. Japan uses 2,136 jōyō kanji for everyday communication.

Grapheme Cluster

The smallest visual unit that a human perceives as a single character. May consist of multiple code points.

Emoji

Pictographic symbols encoded in Unicode. Used to visually express emotions and concepts in text communication.

Romaji

The romanization of Japanese using Latin alphabet characters. Hepburn and Kunrei-shiki are the main systems.

Zero-Width Space

An invisible character with zero display width (U+200B). Used as a line break hint and for text processing control.

Diacritical Mark

Auxiliary symbols added above or below characters. Indicates pronunciation differences such as accents and umlauts.

Ideograph

A writing system where characters themselves carry meaning. Chinese characters (kanji) are the prime example, encoded as CJK Unified Ideographs in Unicode.

ZWJ (Zero Width Joiner)

A zero-width control character (U+200D) in Unicode used to join multiple characters or emoji into a single display unit.

Variant Glyph

Kanji characters that share the same meaning and reading but differ in visual form. Includes relationships such as standard forms, popular forms, old-style characters, and simplified characters.

Control Character

Special characters that are not displayed on screen but instruct how text should be processed. Examples include line feed, tab, carriage return, and null.

Invisible Character

A collective term for characters that exist within text data but have no visible representation on screen. Includes zero-width spaces, bidirectional control characters, and soft hyphens.

Joyo Kanji (Common-Use Kanji)

A list of 2,136 kanji designated by Japan's Council for Cultural Affairs as a guideline for everyday kanji usage.

Character Type

A classification of the characters that make up text, including hiragana, katakana, kanji, Latin letters, digits, and symbols.

Text Processing

Token

The smallest unit of text processing. LLMs use their own tokenization schemes that differ from characters or words.

Truncation

The process of cutting text to a specified length. Used to fit display areas or database column limits.

Line Break

The process of wrapping text to the next line. Controlled in CSS by word-break and overflow-wrap properties.

Newline Code

Control characters representing line breaks. Three types exist: LF (Unix), CR (old Mac), and CRLF (Windows).

Unicode Normalization

The process of unifying different representations of the same character. Four forms exist: NFC, NFD, NFKC, and NFKD.

Trim

The process of removing whitespace from the beginning and end of a string. Provided as a standard method in most programming languages.

Escape Sequence

A string used to represent special characters. A backslash followed by a character represents newlines, tabs, and other control characters.

String Concatenation

The process of joining multiple strings into one. Achieved using the + operator, template literals, or dedicated methods.

Substring

The process of extracting a portion of a string. Achieved using methods like slice, substring, or substr.

String Interpolation

Embedding variable or expression values within a string using template literals or similar syntax.

Padding

Filling a string with specific characters to reach a desired length. Implemented with padStart and padEnd methods.

Base64

An encoding scheme that converts binary data to ASCII strings using 64 characters: A-Z, a-z, 0-9, +, and /.

Percent-Encoding

An encoding scheme that represents special characters in URLs using %XX hexadecimal format. Also known as URL encoding.

Diff

The process of detecting and displaying differences between two texts. Foundation technology for version control and code review.

Text Compression

Technology for reducing text data size. Algorithms like gzip, Brotli, and deflate are commonly used.

Levenshtein Distance

The edit distance between two strings. The minimum number of insertions, deletions, and substitutions needed to transform one string into another.

Fuzzy Matching

A search technique that finds similar strings rather than exact matches. Handles typos and spelling variations.

Flick Input

A Japanese smartphone input method where characters are selected by flicking keys in four directions on a touchscreen. Faster than toggle input.

Validation

The process of verifying that input data conforms to specified formats, ranges, and constraints. Includes character count limits, character type checks, and format verification.

Placeholder

Temporary text displayed inside an input field that shows users the expected format or provides an example. It disappears when the user begins typing.

Case Conversion

The process of converting alphabetic characters between uppercase and lowercase forms. Conversion rules vary by language, and in some cases the character count changes after conversion.

Parsing

The process of analyzing text data according to syntactic rules and converting it into structured data.

Chunk

A smaller unit produced by dividing a large body of data or text into manageable pieces. Used for AI token limit management, streaming delivery, and file transfer.

Proofreading and Editing

The revision process for improving the quality of written text. Editing focuses on refining expression and structure, while proofreading targets typos and factual errors.

OCR (Optical Character Recognition)

A technology that automatically recognizes characters in images or scanned documents and converts them into editable text data.

Predictive Text Input

A feature that predicts the word or phrase a user is about to type based on the characters entered so far, presenting suggestions in a candidate list.

Text Editor

Software designed for creating and editing text files. It works with plain text and typically offers features such as character counting, find-and-replace, and syntax highlighting.

Manuscript

The source text created for printing, publishing, or broadcasting. The 400-character manuscript sheet is the standard unit for Japanese text volume.

Sorting (String Ordering)

The process of arranging strings in a specific order. The correct order varies by language and culture.

Platform Limits

Internationalization

Regular Expressions

Natural Language Processing

Morphological Analysis

The process of segmenting text into minimal meaningful units (morphemes) and assigning grammatical information.

Tokenization

The process of splitting text into tokens (words, subwords, or other processing units).

Stopword

Frequently occurring words excluded from search and text analysis, such as "a," "the," "is," and "in."

N-gram

A method of splitting text into contiguous subsequences of N characters or words, used in search and text similarity.

Sentiment Analysis

The process of determining emotional polarity (positive, negative, neutral) from text.

TF-IDF

Term Frequency-Inverse Document Frequency. A method for quantifying word importance within documents.

Named Entity Recognition (NER)

An NLP technique that automatically identifies and classifies named entities like person names, locations, and organizations from text.

Text Mining

A set of techniques for extracting useful patterns and insights from large volumes of text data using statistical and linguistic methods.

Text Summarization

The process of condensing a long text into a shorter version that preserves the key points. There are two main approaches: extractive and abstractive. An essential technique for communicating information under character limits.

BPE (Byte Pair Encoding)

An algorithm that splits text into subword units based on frequently co-occurring byte pairs. Widely adopted as the tokenizer in large language models such as GPT and BERT.

Natural Language Processing (NLP)

The broad field of technology concerned with processing, understanding, and generating human language by computer.

Machine Translation

Technology that automatically translates text from one language to another using a computer.

Typography

Line Height

The vertical spacing between lines of text. Controlled by the CSS line-height property, it significantly affects readability.

Font Size

The display size of text characters. Specified in CSS using units like px, em, rem, and vw.

Whitespace

Invisible characters such as spaces, tabs, and newlines. They play important roles in text processing and layout.

Ligature

A typographic technique that combines two or more characters into a single glyph. Common examples include fi, fl, and ff.

Kerning

The technique of adjusting spacing between adjacent characters to achieve visually even spacing based on character combinations.

Speech Bubble

A graphic element enclosing character dialogue in comics and chat UIs. Character count constraints within limited space are closely tied to design.

Ruby Annotation

Small characters placed above a base character to indicate its pronunciation. Implemented in HTML using the ruby element.

Character Width

The horizontal space each character occupies when text is displayed. Encompasses the distinction between full-width and half-width characters, as well as variable widths in proportional fonts.

Punctuation

A collective term for symbols used in writing to clarify sentence structure and meaning, including periods, commas, quotation marks, and brackets. Their types, usage rules, and character widths vary by language and region.

Vertical Writing

A writing direction in which text flows from top to bottom and lines progress from right to left. Traditionally used in Japanese and Chinese, it is implemented on the web using the CSS writing-mode property.

Word Wrap

The automatic process of breaking text onto the next line when it exceeds the width of the display area. Whether the break occurs mid-word or at word boundaries depends on the language and settings.

Quotation Mark

A punctuation mark used in text to indicate quotations, dialogue, emphasis, or titles. The shape varies by language and region.

Hyphens and Dashes

Horizontal line symbols used in text for joining words, indicating ranges, and setting off parenthetical phrases.

Font (Typeface)

A dataset that defines the visual design of characters. Directly affects character display width and readability.

Indentation

A formatting technique that inserts whitespace at the beginning of a line to indicate paragraph starts or hierarchical structure.

Letter Spacing (Tracking)

The distance between characters in text. Controlled by CSS letter-spacing, it affects readability and design impression.

Data Formats

Security

Accessibility