Glossary

Character Count

The total number of characters in a text, including or excluding spaces depending on context.

Byte Count

The size of text data in bytes after encoding. The same character can have different byte sizes depending on the encoding.

Word Count

The number of words in a text. In English, words are typically separated by spaces.

Reading Time

The estimated time required to read a text, calculated from word or character count.

Paragraph Count

The number of paragraphs in a text. Used as a metric for text structure and readability.

Sentence Count

The number of sentences in a text. Counted by sentence-ending punctuation like periods, question marks, and exclamation marks.

Readability Score

A numerical metric quantifying text readability. Flesch Reading Ease and Flesch-Kincaid Grade Level are representative examples.

Unicode

A universal character encoding standard that covers over 140,000 characters from all writing systems worldwide.

UTF-8

A variable-length Unicode encoding. The dominant character encoding on the web, used by over 98% of websites.

Shift_JIS

A Japanese character encoding widely used in legacy systems. Being gradually replaced by UTF-8.

ASCII

A 7-bit character encoding standard representing 128 characters including English letters, digits, and basic symbols.

UTF-16

A Unicode encoding that uses 16-bit code units. Used internally by JavaScript, Java, and Windows.

EUC-JP

A Japanese character encoding widely used on UNIX systems. Part of the Extended Unix Code family.

ISO-2022-JP

A Japanese encoding designed for email. Uses escape sequences to switch between character sets.

BOM (Byte Order Mark)

A byte sequence at the start of a file that identifies the encoding. EF BB BF for UTF-8, FF FE or FE FF for UTF-16.

Code Point

A unique number assigned to each character in Unicode. Written as U+ followed by hexadecimal digits, e.g., U+0041 (A).

Surrogate Pair

A mechanism in UTF-16 to represent characters outside the BMP using two 16-bit code units.

Combining Character

A Unicode character that combines with the preceding base character for display. Includes diacritical marks and dakuten.

Endianness

The byte order of multi-byte data. Two types exist: big-endian and little-endian.

Character Set

A defined collection of characters and their numbering system. ASCII, ISO 8859, and Unicode are representative examples.

Full-Width Character

A character that occupies twice the width of a half-width character in fixed-width fonts. Common in CJK text.

Half-Width Character

A character that occupies half the width of a full-width character in fixed-width fonts. ASCII characters are half-width.

Hiragana

One of the Japanese phonetic writing systems. Used for native words, particles, and verb endings.

Katakana

One of the Japanese phonetic writing systems. Used for loanwords, onomatopoeia, and scientific terms.

Kanji

Logographic characters originating from China. Japan uses 2,136 jōyō kanji for everyday communication.

Grapheme Cluster

The smallest visual unit that a human perceives as a single character. May consist of multiple code points.

Emoji

Pictographic symbols encoded in Unicode. Used to visually express emotions and concepts in text communication.

Romaji

The romanization of Japanese using Latin alphabet characters. Hepburn and Kunrei-shiki are the main systems.

Zero-Width Space

An invisible character with zero display width (U+200B). Used as a line break hint and for text processing control.

Diacritical Mark

Auxiliary symbols added above or below characters. Indicates pronunciation differences such as accents and umlauts.

Ideograph

A writing system where characters themselves carry meaning. Chinese characters (kanji) are the prime example, encoded as CJK Unified Ideographs in Unicode.

Token

The smallest unit of text processing. LLMs use their own tokenization schemes that differ from characters or words.

Truncation

The process of cutting text to a specified length. Used to fit display areas or database column limits.

Line Break

The process of wrapping text to the next line. Controlled in CSS by word-break and overflow-wrap properties.

Newline Code

Control characters representing line breaks. Three types exist: LF (Unix), CR (old Mac), and CRLF (Windows).

Unicode Normalization

The process of unifying different representations of the same character. Four forms exist: NFC, NFD, NFKC, and NFKD.

Trim

The process of removing whitespace from the beginning and end of a string. Provided as a standard method in most programming languages.

Escape Sequence

A string used to represent special characters. A backslash followed by a character represents newlines, tabs, and other control characters.

String Concatenation

The process of joining multiple strings into one. Achieved using the + operator, template literals, or dedicated methods.

Substring

The process of extracting a portion of a string. Achieved using methods like slice, substring, or substr.

String Interpolation

Embedding variable or expression values within a string using template literals or similar syntax.

Padding

Filling a string with specific characters to reach a desired length. Implemented with padStart and padEnd methods.

Base64

An encoding scheme that converts binary data to ASCII strings using 64 characters: A-Z, a-z, 0-9, +, and /.

Percent-Encoding

An encoding scheme that represents special characters in URLs using %XX hexadecimal format. Also known as URL encoding.

Diff

The process of detecting and displaying differences between two texts. Foundation technology for version control and code review.

Text Compression

Technology for reducing text data size. Algorithms like gzip, Brotli, and deflate are commonly used.

Levenshtein Distance

The edit distance between two strings. The minimum number of insertions, deletions, and substitutions needed to transform one string into another.

Fuzzy Matching

A search technique that finds similar strings rather than exact matches. Handles typos and spelling variations.

Character Limit

The maximum number of characters allowed for text input on a platform or system. Applied in social media, ads, and forms.

Meta Description

The HTML meta description tag. A page summary shown in search results, typically 150-160 characters.

Title Tag

The HTML title element. Displayed in search results and browser tabs, with 50-60 characters recommended.

Alt Text (alt attribute)

Alternative text for images. Important for accessibility and SEO, displayed when images cannot be loaded.

Slug (URL Slug)

A human-readable identifier used in the path portion of a URL. Affects SEO and usability.

Open Graph

A meta tag protocol that controls how links appear when shared on social media. Created by Facebook.

X (Twitter) Character Limit

X (formerly Twitter) posts are limited to 280 characters. CJK characters count as 2 characters each.

Instagram Caption Limit

Instagram captions allow up to 2,200 characters. Up to 30 hashtags can be used per post.

SMS Character Limit

SMS messages are limited to 160 characters (GSM 7-bit) or 70 characters (Unicode/UCS-2). Longer messages are split.

Locale

A combination of language, region, and formatting settings, identified by codes like ja-JP, en-US.

ICU (International Components for Unicode)

A Unicode internationalization library providing string collation, conversion, formatting, and multilingual processing.

Bidirectional Text (BiDi)

Handling of mixed left-to-right (LTR) and right-to-left (RTL) text, needed for Arabic and Hebrew in multilingual content.

CJK (Chinese-Japanese-Korean Unified Ideographs)

A system for handling Chinese, Japanese, and Korean characters unified in Unicode as CJK Unified Ideographs.

Input Method (IME)

Software that enables typing characters not directly available on a keyboard, such as Japanese and Chinese characters.

Collation

Rules for comparing and sorting strings. Defines sort order that varies by language and culture.

Transliteration

The process of converting text from one writing system to another while preserving phonetics.

Regular Expression Pattern

A pattern language for searching and replacing text. Combines special and literal characters to define string patterns.

Regex Quantifier

Metacharacters like *, +, ?, {n,m} that specify repetition counts. They control how many times the preceding element appears.

Regex Character Class

Syntax for specifying character sets like [a-z], d, w. Defines the range of characters to match.

Regex Group

Capture groups using () and backreferences. Groups part of a pattern to capture and reuse matched substrings.

Regex Lookahead

A regex syntax using (?=...) and (?!...) to match based on what follows without consuming characters.

Regex Backreference

A feature that reuses text matched by a capture group within the same pattern. Referenced using \1, \2, etc.

Morphological Analysis

The process of segmenting text into minimal meaningful units (morphemes) and assigning grammatical information.

Tokenization

The process of splitting text into tokens (words, subwords, or other processing units).

Stopword

Frequently occurring words excluded from search and text analysis, such as "a," "the," "is," and "in."

N-gram

A method of splitting text into contiguous subsequences of N characters or words, used in search and text similarity.

Sentiment Analysis

The process of determining emotional polarity (positive, negative, neutral) from text.

TF-IDF

Term Frequency-Inverse Document Frequency. A method for quantifying word importance within documents.

Named Entity Recognition (NER)

An NLP technique that automatically identifies and classifies named entities like person names, locations, and organizations from text.

Line Height

The vertical spacing between lines of text. Controlled by the CSS line-height property, it significantly affects readability.

Whitespace

Invisible characters such as spaces, tabs, and newlines. They play important roles in text processing and layout.

Ligature

A typographic technique that combines two or more characters into a single glyph. Common examples include fi, fl, and ff.

Kerning

The technique of adjusting spacing between adjacent characters to achieve visually even spacing based on character combinations.

JSON

JavaScript Object Notation, a lightweight data interchange format that is easy for both humans and machines to read.

CSV

Comma-Separated Values, a text format that represents data with comma delimiters. Widely used for exchanging tabular data.

XML

Extensible Markup Language, a markup language that describes data structure using tags.

YAML

YAML Ain't Markup Language, an indentation-based human-readable data serialization format.

Markdown

A lightweight markup language that adds formatting to plain text using simple syntax, convertible to HTML.

HTML Entity

Character references for representing special characters in HTML. Starts with & and ends with ;.

MIME Type

A standard classification system for identifying file and data types. Expressed in type/subtype format.

Hash Value

A fixed-length value generated from arbitrary-length data using a hash function. Used for data integrity verification and tamper detection.

Checksum

A value computed for error detection in data. Used to verify data integrity during transfer and storage.

Encryption

The process of converting data into an unreadable format. Only those with the decryption key can restore the original data.

Plain Text

Unencrypted text data that is directly readable by humans.

Sanitization

The process of removing or neutralizing harmful code and invalid characters from user input. A fundamental defense against XSS and SQL injection.

Screen Reader

Assistive technology that reads aloud text and UI elements on screen. Supports web access for visually impaired users.

ARIA Label

An attribute defined in the WAI-ARIA specification that provides an accessible name to UI elements. Specifies text read by screen readers.

Contrast Ratio

A numerical ratio of luminance difference between foreground and background colors. WCAG requires 4.5:1 or higher for text readability.

Semantic HTML

Using HTML elements that clearly convey content meaning and structure. Properly using elements like header, nav, main, article, and section.

Focus Indicator

A visual display showing which element currently has keyboard focus. Typically shown as an outline or highlight.

Text-to-Speech (TTS)

Technology that converts text data into speech. Foundation technology for screen readers and voice assistants.

Text Measurement