Glossary
Text Measurement
Character Count
The total number of characters in a text, including or excluding spaces depending on context.
Byte Count
The size of text data in bytes after encoding. The same character can have different byte sizes depending on the encoding.
Word Count
The number of words in a text. In English, words are typically separated by spaces.
Reading Time
The estimated time required to read a text, calculated from word or character count.
Paragraph Count
The number of paragraphs in a text. Used as a metric for text structure and readability.
Sentence Count
The number of sentences in a text. Counted by sentence-ending punctuation like periods, question marks, and exclamation marks.
Readability Score
A numerical metric quantifying text readability. Flesch Reading Ease and Flesch-Kincaid Grade Level are representative examples.
Character Encoding
Unicode
A universal character encoding standard that covers over 140,000 characters from all writing systems worldwide.
UTF-8
A variable-length Unicode encoding. The dominant character encoding on the web, used by over 98% of websites.
Shift_JIS
A Japanese character encoding widely used in legacy systems. Being gradually replaced by UTF-8.
ASCII
A 7-bit character encoding standard representing 128 characters including English letters, digits, and basic symbols.
UTF-16
A Unicode encoding that uses 16-bit code units. Used internally by JavaScript, Java, and Windows.
EUC-JP
A Japanese character encoding widely used on UNIX systems. Part of the Extended Unix Code family.
ISO-2022-JP
A Japanese encoding designed for email. Uses escape sequences to switch between character sets.
BOM (Byte Order Mark)
A byte sequence at the start of a file that identifies the encoding. EF BB BF for UTF-8, FF FE or FE FF for UTF-16.
Code Point
A unique number assigned to each character in Unicode. Written as U+ followed by hexadecimal digits, e.g., U+0041 (A).
Surrogate Pair
A mechanism in UTF-16 to represent characters outside the BMP using two 16-bit code units.
Combining Character
A Unicode character that combines with the preceding base character for display. Includes diacritical marks and dakuten.
Endianness
The byte order of multi-byte data. Two types exist: big-endian and little-endian.
Character Set
A defined collection of characters and their numbering system. ASCII, ISO 8859, and Unicode are representative examples.
Character Types
Full-Width Character
A character that occupies twice the width of a half-width character in fixed-width fonts. Common in CJK text.
Half-Width Character
A character that occupies half the width of a full-width character in fixed-width fonts. ASCII characters are half-width.
Hiragana
One of the Japanese phonetic writing systems. Used for native words, particles, and verb endings.
Katakana
One of the Japanese phonetic writing systems. Used for loanwords, onomatopoeia, and scientific terms.
Kanji
Logographic characters originating from China. Japan uses 2,136 jōyō kanji for everyday communication.
Grapheme Cluster
The smallest visual unit that a human perceives as a single character. May consist of multiple code points.
Emoji
Pictographic symbols encoded in Unicode. Used to visually express emotions and concepts in text communication.
Romaji
The romanization of Japanese using Latin alphabet characters. Hepburn and Kunrei-shiki are the main systems.
Zero-Width Space
An invisible character with zero display width (U+200B). Used as a line break hint and for text processing control.
Diacritical Mark
Auxiliary symbols added above or below characters. Indicates pronunciation differences such as accents and umlauts.
Ideograph
A writing system where characters themselves carry meaning. Chinese characters (kanji) are the prime example, encoded as CJK Unified Ideographs in Unicode.
Text Processing
Token
The smallest unit of text processing. LLMs use their own tokenization schemes that differ from characters or words.
Truncation
The process of cutting text to a specified length. Used to fit display areas or database column limits.
Line Break
The process of wrapping text to the next line. Controlled in CSS by word-break and overflow-wrap properties.
Newline Code
Control characters representing line breaks. Three types exist: LF (Unix), CR (old Mac), and CRLF (Windows).
Unicode Normalization
The process of unifying different representations of the same character. Four forms exist: NFC, NFD, NFKC, and NFKD.
Trim
The process of removing whitespace from the beginning and end of a string. Provided as a standard method in most programming languages.
Escape Sequence
A string used to represent special characters. A backslash followed by a character represents newlines, tabs, and other control characters.
String Concatenation
The process of joining multiple strings into one. Achieved using the + operator, template literals, or dedicated methods.
Substring
The process of extracting a portion of a string. Achieved using methods like slice, substring, or substr.
String Interpolation
Embedding variable or expression values within a string using template literals or similar syntax.
Padding
Filling a string with specific characters to reach a desired length. Implemented with padStart and padEnd methods.
Base64
An encoding scheme that converts binary data to ASCII strings using 64 characters: A-Z, a-z, 0-9, +, and /.
Percent-Encoding
An encoding scheme that represents special characters in URLs using %XX hexadecimal format. Also known as URL encoding.
Diff
The process of detecting and displaying differences between two texts. Foundation technology for version control and code review.
Text Compression
Technology for reducing text data size. Algorithms like gzip, Brotli, and deflate are commonly used.
Levenshtein Distance
The edit distance between two strings. The minimum number of insertions, deletions, and substitutions needed to transform one string into another.
Fuzzy Matching
A search technique that finds similar strings rather than exact matches. Handles typos and spelling variations.
Platform Limits
Character Limit
The maximum number of characters allowed for text input on a platform or system. Applied in social media, ads, and forms.
Meta Description
The HTML meta description tag. A page summary shown in search results, typically 150-160 characters.
Title Tag
The HTML title element. Displayed in search results and browser tabs, with 50-60 characters recommended.
Alt Text (alt attribute)
Alternative text for images. Important for accessibility and SEO, displayed when images cannot be loaded.
Slug (URL Slug)
A human-readable identifier used in the path portion of a URL. Affects SEO and usability.
Open Graph
A meta tag protocol that controls how links appear when shared on social media. Created by Facebook.
X (Twitter) Character Limit
X (formerly Twitter) posts are limited to 280 characters. CJK characters count as 2 characters each.
Instagram Caption Limit
Instagram captions allow up to 2,200 characters. Up to 30 hashtags can be used per post.
SMS Character Limit
SMS messages are limited to 160 characters (GSM 7-bit) or 70 characters (Unicode/UCS-2). Longer messages are split.
Internationalization
Locale
A combination of language, region, and formatting settings, identified by codes like ja-JP, en-US.
ICU (International Components for Unicode)
A Unicode internationalization library providing string collation, conversion, formatting, and multilingual processing.
Bidirectional Text (BiDi)
Handling of mixed left-to-right (LTR) and right-to-left (RTL) text, needed for Arabic and Hebrew in multilingual content.
CJK (Chinese-Japanese-Korean Unified Ideographs)
A system for handling Chinese, Japanese, and Korean characters unified in Unicode as CJK Unified Ideographs.
Input Method (IME)
Software that enables typing characters not directly available on a keyboard, such as Japanese and Chinese characters.
Collation
Rules for comparing and sorting strings. Defines sort order that varies by language and culture.
Transliteration
The process of converting text from one writing system to another while preserving phonetics.
Regular Expressions
Regular Expression Pattern
A pattern language for searching and replacing text. Combines special and literal characters to define string patterns.
Regex Quantifier
Metacharacters like *, +, ?, {n,m} that specify repetition counts. They control how many times the preceding element appears.
Regex Character Class
Syntax for specifying character sets like [a-z], d, w. Defines the range of characters to match.
Regex Group
Capture groups using () and backreferences. Groups part of a pattern to capture and reuse matched substrings.
Regex Lookahead
A regex syntax using (?=...) and (?!...) to match based on what follows without consuming characters.
Regex Backreference
A feature that reuses text matched by a capture group within the same pattern. Referenced using \1, \2, etc.
Natural Language Processing
Morphological Analysis
The process of segmenting text into minimal meaningful units (morphemes) and assigning grammatical information.
Tokenization
The process of splitting text into tokens (words, subwords, or other processing units).
Stopword
Frequently occurring words excluded from search and text analysis, such as "a," "the," "is," and "in."
N-gram
A method of splitting text into contiguous subsequences of N characters or words, used in search and text similarity.
Sentiment Analysis
The process of determining emotional polarity (positive, negative, neutral) from text.
TF-IDF
Term Frequency-Inverse Document Frequency. A method for quantifying word importance within documents.
Named Entity Recognition (NER)
An NLP technique that automatically identifies and classifies named entities like person names, locations, and organizations from text.
Typography
Line Height
The vertical spacing between lines of text. Controlled by the CSS line-height property, it significantly affects readability.
Font Size
The display size of text characters. Specified in CSS using units like px, em, rem, and vw.
Whitespace
Invisible characters such as spaces, tabs, and newlines. They play important roles in text processing and layout.
Ligature
A typographic technique that combines two or more characters into a single glyph. Common examples include fi, fl, and ff.
Kerning
The technique of adjusting spacing between adjacent characters to achieve visually even spacing based on character combinations.
Data Formats
JSON
JavaScript Object Notation, a lightweight data interchange format that is easy for both humans and machines to read.
CSV
Comma-Separated Values, a text format that represents data with comma delimiters. Widely used for exchanging tabular data.
XML
Extensible Markup Language, a markup language that describes data structure using tags.
YAML
YAML Ain't Markup Language, an indentation-based human-readable data serialization format.
Markdown
A lightweight markup language that adds formatting to plain text using simple syntax, convertible to HTML.
HTML Entity
Character references for representing special characters in HTML. Starts with & and ends with ;.
MIME Type
A standard classification system for identifying file and data types. Expressed in type/subtype format.
Security
Hash Value
A fixed-length value generated from arbitrary-length data using a hash function. Used for data integrity verification and tamper detection.
Checksum
A value computed for error detection in data. Used to verify data integrity during transfer and storage.
Encryption
The process of converting data into an unreadable format. Only those with the decryption key can restore the original data.
Plain Text
Unencrypted text data that is directly readable by humans.
Sanitization
The process of removing or neutralizing harmful code and invalid characters from user input. A fundamental defense against XSS and SQL injection.
Accessibility
Screen Reader
Assistive technology that reads aloud text and UI elements on screen. Supports web access for visually impaired users.
ARIA Label
An attribute defined in the WAI-ARIA specification that provides an accessible name to UI elements. Specifies text read by screen readers.
Contrast Ratio
A numerical ratio of luminance difference between foreground and background colors. WCAG requires 4.5:1 or higher for text readability.
Semantic HTML
Using HTML elements that clearly convey content meaning and structure. Properly using elements like header, nav, main, article, and section.
Focus Indicator
A visual display showing which element currently has keyboard focus. Typically shown as an outline or highlight.
Text-to-Speech (TTS)
Technology that converts text data into speech. Foundation technology for screen readers and voice assistants.