Glossary

Text Measurement

Character Encoding

Unicode

A universal character encoding standard that covers over 140,000 characters from all writing systems worldwide.

UTF-8

A variable-length Unicode encoding. The dominant character encoding on the web, used by over 98% of websites.

Shift_JIS

A Japanese character encoding widely used in legacy systems. Being gradually replaced by UTF-8.

ASCII

A 7-bit character encoding standard representing 128 characters including English letters, digits, and basic symbols.

UTF-16

A Unicode encoding that uses 16-bit code units. Used internally by JavaScript, Java, and Windows.

EUC-JP

A Japanese character encoding widely used on UNIX systems. Part of the Extended Unix Code family.

ISO-2022-JP

A Japanese encoding designed for email. Uses escape sequences to switch between character sets.

BOM (Byte Order Mark)

A byte sequence at the start of a file that identifies the encoding. EF BB BF for UTF-8, FF FE or FE FF for UTF-16.

Code Point

A unique number assigned to each character in Unicode. Written as U+ followed by hexadecimal digits, e.g., U+0041 (A).

Surrogate Pair

A mechanism in UTF-16 to represent characters outside the BMP using two 16-bit code units.

Combining Character

A Unicode character that combines with the preceding base character for display. Includes diacritical marks and dakuten.

Endianness

The byte order of multi-byte data. Two types exist: big-endian and little-endian.

Character Set

A defined collection of characters and their numbering system. ASCII, ISO 8859, and Unicode are representative examples.

Character Types

Text Processing

Token

The smallest unit of text processing. LLMs use their own tokenization schemes that differ from characters or words.

Truncation

The process of cutting text to a specified length. Used to fit display areas or database column limits.

Line Break

The process of wrapping text to the next line. Controlled in CSS by word-break and overflow-wrap properties.

Newline Code

Control characters representing line breaks. Three types exist: LF (Unix), CR (old Mac), and CRLF (Windows).

Unicode Normalization

The process of unifying different representations of the same character. Four forms exist: NFC, NFD, NFKC, and NFKD.

Trim

The process of removing whitespace from the beginning and end of a string. Provided as a standard method in most programming languages.

Escape Sequence

A string used to represent special characters. A backslash followed by a character represents newlines, tabs, and other control characters.

String Concatenation

The process of joining multiple strings into one. Achieved using the + operator, template literals, or dedicated methods.

Substring

The process of extracting a portion of a string. Achieved using methods like slice, substring, or substr.

String Interpolation

Embedding variable or expression values within a string using template literals or similar syntax.

Padding

Filling a string with specific characters to reach a desired length. Implemented with padStart and padEnd methods.

Base64

An encoding scheme that converts binary data to ASCII strings using 64 characters: A-Z, a-z, 0-9, +, and /.

Percent-Encoding

An encoding scheme that represents special characters in URLs using %XX hexadecimal format. Also known as URL encoding.

Diff

The process of detecting and displaying differences between two texts. Foundation technology for version control and code review.

Text Compression

Technology for reducing text data size. Algorithms like gzip, Brotli, and deflate are commonly used.

Levenshtein Distance

The edit distance between two strings. The minimum number of insertions, deletions, and substitutions needed to transform one string into another.

Fuzzy Matching

A search technique that finds similar strings rather than exact matches. Handles typos and spelling variations.

Platform Limits

Internationalization

Regular Expressions

Natural Language Processing

Typography

Data Formats

Security

Accessibility