Glossary

Character Count

The total number of characters in a text, including or excluding spaces depending on context.

Byte Count

The size of text data in bytes after encoding. The same character can have different byte sizes depending on the encoding.

Word Count

The number of words in a text. In English, words are typically separated by spaces.

Reading Time

The estimated time required to read a text, calculated from word or character count.

Paragraph Count

The number of paragraphs in a text. Used as a metric for text structure and readability.

Sentence Count

The number of sentences in a text. Counted by sentence-ending punctuation like periods, question marks, and exclamation marks.

Readability Score

A numerical metric quantifying text readability. Flesch Reading Ease and Flesch-Kincaid Grade Level are representative examples.

WPM (Words Per Minute)

A unit measuring typing speed as words typed per minute. Standard in English-speaking countries, with one word defined as five keystrokes on average.

Genkō Yōshi (Japanese Manuscript Paper)

Japanese grid paper for writing, with the standard format being 400 squares (20 characters × 20 lines), serving as a fundamental unit for character count management.

Line Count

The number of lines in a text. Logical lines and visual lines are the two types, used for calculating character counts and page counts.

Unicode

A universal character encoding standard that covers over 140,000 characters from all writing systems worldwide.

UTF-8

A variable-length Unicode encoding. The dominant character encoding on the web, used by over 98% of websites.

Shift_JIS

A Japanese character encoding widely used in legacy systems. Being gradually replaced by UTF-8.

ASCII

A 7-bit character encoding standard representing 128 characters including English letters, digits, and basic symbols.

UTF-16

A Unicode encoding that uses 16-bit code units. Used internally by JavaScript, Java, and Windows.

EUC-JP

A Japanese character encoding widely used on UNIX systems. Part of the Extended Unix Code family.

ISO-2022-JP

A Japanese encoding designed for email. Uses escape sequences to switch between character sets.

BOM (Byte Order Mark)

A byte sequence at the start of a file that identifies the encoding. EF BB BF for UTF-8, FF FE or FE FF for UTF-16.

Code Point

A unique number assigned to each character in Unicode. Written as U+ followed by hexadecimal digits, e.g., U+0041 (A).

Surrogate Pair

A mechanism in UTF-16 to represent characters outside the BMP using two 16-bit code units.

Combining Character

A Unicode character that combines with the preceding base character for display. Includes diacritical marks and dakuten.

Endianness

The byte order of multi-byte data. Two types exist: big-endian and little-endian.

Character Set

A defined collection of characters and their numbering system. ASCII, ISO 8859, and Unicode are representative examples.

GSM-7 Encoding

A 7-bit character encoding used in SMS, allowing 160 characters per message for alphanumeric text and basic symbols.

Mojibake (Character Corruption)

A phenomenon where text displays as garbled symbols or incorrect characters due to a mismatch between the encoding used to write the data and the encoding used to read it.

A system of rules that maps characters to sequences of bits. It consists of two layers: a character set (which characters are included) and an encoding scheme (how those characters are converted to byte sequences).

JIS (Japanese Industrial Standards)

Japan's national standards for industrial products. In the character encoding domain, JIS X 0208 and JIS X 0213 form the foundation of Japanese text processing.

Variable-Length Encoding

An encoding scheme where different characters use different numbers of bytes. UTF-8 and Shift_JIS are representative examples, achieving efficiency by representing frequently used characters with shorter byte sequences.

BMP (Basic Multilingual Plane)

The first 65,536 code points (U+0000 to U+FFFF) in Unicode. Most characters used in everyday writing are found here; characters outside this range require surrogate pairs in UTF-16.

Full-Width Character

A character that occupies twice the width of a half-width character in fixed-width fonts. Common in CJK text.

Half-Width Character

A character that occupies half the width of a full-width character in fixed-width fonts. ASCII characters are half-width.

Hiragana

One of the Japanese phonetic writing systems. Used for native words, particles, and verb endings.

Katakana

One of the Japanese phonetic writing systems. Used for loanwords, onomatopoeia, and scientific terms.

Kanji

Logographic characters originating from China. Japan uses 2,136 jōyō kanji for everyday communication.

Grapheme Cluster

The smallest visual unit that a human perceives as a single character. May consist of multiple code points.

Emoji

Pictographic symbols encoded in Unicode. Used to visually express emotions and concepts in text communication.

Romaji

The romanization of Japanese using Latin alphabet characters. Hepburn and Kunrei-shiki are the main systems.

Zero-Width Space

An invisible character with zero display width (U+200B). Used as a line break hint and for text processing control.

Diacritical Mark

Auxiliary symbols added above or below characters. Indicates pronunciation differences such as accents and umlauts.

Ideograph

A writing system where characters themselves carry meaning. Chinese characters (kanji) are the prime example, encoded as CJK Unified Ideographs in Unicode.

ZWJ (Zero Width Joiner)

A zero-width control character (U+200D) in Unicode used to join multiple characters or emoji into a single display unit.

Variant Glyph

Kanji characters that share the same meaning and reading but differ in visual form. Includes relationships such as standard forms, popular forms, old-style characters, and simplified characters.

Control Character

Special characters that are not displayed on screen but instruct how text should be processed. Examples include line feed, tab, carriage return, and null.

Invisible Character

A collective term for characters that exist within text data but have no visible representation on screen. Includes zero-width spaces, bidirectional control characters, and soft hyphens.

Joyo Kanji (Common-Use Kanji)

A list of 2,136 kanji designated by Japan's Council for Cultural Affairs as a guideline for everyday kanji usage.

Character Type

A classification of the characters that make up text, including hiragana, katakana, kanji, Latin letters, digits, and symbols.

Token

The smallest unit of text processing. LLMs use their own tokenization schemes that differ from characters or words.

Truncation

The process of cutting text to a specified length. Used to fit display areas or database column limits.

Line Break

The process of wrapping text to the next line. Controlled in CSS by word-break and overflow-wrap properties.

Newline Code

Control characters representing line breaks. Three types exist: LF (Unix), CR (old Mac), and CRLF (Windows).

Unicode Normalization

The process of unifying different representations of the same character. Four forms exist: NFC, NFD, NFKC, and NFKD.

Trim

The process of removing whitespace from the beginning and end of a string. Provided as a standard method in most programming languages.

Escape Sequence

A string used to represent special characters. A backslash followed by a character represents newlines, tabs, and other control characters.

String Concatenation

The process of joining multiple strings into one. Achieved using the + operator, template literals, or dedicated methods.

Substring

The process of extracting a portion of a string. Achieved using methods like slice, substring, or substr.

String Interpolation

Embedding variable or expression values within a string using template literals or similar syntax.

Padding

Filling a string with specific characters to reach a desired length. Implemented with padStart and padEnd methods.

Base64

An encoding scheme that converts binary data to ASCII strings using 64 characters: A-Z, a-z, 0-9, +, and /.

Percent-Encoding

An encoding scheme that represents special characters in URLs using %XX hexadecimal format. Also known as URL encoding.

Diff

The process of detecting and displaying differences between two texts. Foundation technology for version control and code review.

Text Compression

Technology for reducing text data size. Algorithms like gzip, Brotli, and deflate are commonly used.

Levenshtein Distance

The edit distance between two strings. The minimum number of insertions, deletions, and substitutions needed to transform one string into another.

Fuzzy Matching

A search technique that finds similar strings rather than exact matches. Handles typos and spelling variations.

Flick Input

A Japanese smartphone input method where characters are selected by flicking keys in four directions on a touchscreen. Faster than toggle input.

Validation

The process of verifying that input data conforms to specified formats, ranges, and constraints. Includes character count limits, character type checks, and format verification.

Placeholder

Temporary text displayed inside an input field that shows users the expected format or provides an example. It disappears when the user begins typing.

Case Conversion

The process of converting alphabetic characters between uppercase and lowercase forms. Conversion rules vary by language, and in some cases the character count changes after conversion.

Parsing

The process of analyzing text data according to syntactic rules and converting it into structured data.

Chunk

A smaller unit produced by dividing a large body of data or text into manageable pieces. Used for AI token limit management, streaming delivery, and file transfer.

Proofreading and Editing

The revision process for improving the quality of written text. Editing focuses on refining expression and structure, while proofreading targets typos and factual errors.

OCR (Optical Character Recognition)

A technology that automatically recognizes characters in images or scanned documents and converts them into editable text data.

Predictive Text Input

A feature that predicts the word or phrase a user is about to type based on the characters entered so far, presenting suggestions in a candidate list.

Text Editor

Software designed for creating and editing text files. It works with plain text and typically offers features such as character counting, find-and-replace, and syntax highlighting.

Manuscript

The source text created for printing, publishing, or broadcasting. The 400-character manuscript sheet is the standard unit for Japanese text volume.

Sorting (String Ordering)

The process of arranging strings in a specific order. The correct order varies by language and culture.

Character Limit

The maximum number of characters allowed for text input on a platform or system. Applied in social media, ads, and forms.

Meta Description

The HTML meta description tag. A page summary shown in search results, typically 150-160 characters.

Title Tag

The HTML title element. Displayed in search results and browser tabs, with 50-60 characters recommended.

Alt Text (alt attribute)

Alternative text for images. Important for accessibility and SEO, displayed when images cannot be loaded.

Slug (URL Slug)

A human-readable identifier used in the path portion of a URL. Affects SEO and usability.

Open Graph

A meta tag protocol that controls how links appear when shared on social media. Created by Facebook.

X (Twitter) Character Limit

X (formerly Twitter) posts are limited to 280 characters. CJK characters count as 2 characters each.

Instagram Caption Limit

Instagram captions allow up to 2,200 characters. Up to 30 hashtags can be used per post.

SMS Character Limit

SMS messages are limited to 160 characters (GSM 7-bit) or 70 characters (Unicode/UCS-2). Longer messages are split.

Hashtag

A keyword prefixed with the # symbol, used as metadata on social media posts for categorization and discoverability.

Caption

Descriptive text accompanying images or videos. On social media, it refers to the post body text, with character limits varying by platform.

Locale

A combination of language, region, and formatting settings, identified by codes like ja-JP, en-US.

ICU (International Components for Unicode)

A Unicode internationalization library providing string collation, conversion, formatting, and multilingual processing.

Bidirectional Text (BiDi)

Handling of mixed left-to-right (LTR) and right-to-left (RTL) text, needed for Arabic and Hebrew in multilingual content.

CJK (Chinese-Japanese-Korean Unified Ideographs)

A system for handling Chinese, Japanese, and Korean characters unified in Unicode as CJK Unified Ideographs.

Input Method (IME)

Software that enables typing characters not directly available on a keyboard, such as Japanese and Chinese characters.

Collation

Rules for comparing and sorting strings. Defines sort order that varies by language and culture.

Transliteration

The process of converting text from one writing system to another while preserving phonetics.

Regular Expression Pattern

A pattern language for searching and replacing text. Combines special and literal characters to define string patterns.

Regex Quantifier

Metacharacters like *, +, ?, {n,m} that specify repetition counts. They control how many times the preceding element appears.

Regex Character Class

Syntax for specifying character sets like [a-z], d, w. Defines the range of characters to match.

Regex Group

Capture groups using () and backreferences. Groups part of a pattern to capture and reuse matched substrings.

Regex Lookahead

A regex syntax using (?=...) and (?!...) to match based on what follows without consuming characters.

Regex Backreference

A feature that reuses text matched by a capture group within the same pattern. Referenced using \1, \2, etc.

Morphological Analysis

The process of segmenting text into minimal meaningful units (morphemes) and assigning grammatical information.

Tokenization

The process of splitting text into tokens (words, subwords, or other processing units).

Stopword

Frequently occurring words excluded from search and text analysis, such as "a," "the," "is," and "in."

N-gram

A method of splitting text into contiguous subsequences of N characters or words, used in search and text similarity.

Sentiment Analysis

The process of determining emotional polarity (positive, negative, neutral) from text.

TF-IDF

Term Frequency-Inverse Document Frequency. A method for quantifying word importance within documents.

Named Entity Recognition (NER)

An NLP technique that automatically identifies and classifies named entities like person names, locations, and organizations from text.

Text Mining

A set of techniques for extracting useful patterns and insights from large volumes of text data using statistical and linguistic methods.

Text Summarization

The process of condensing a long text into a shorter version that preserves the key points. There are two main approaches: extractive and abstractive. An essential technique for communicating information under character limits.

BPE (Byte Pair Encoding)

An algorithm that splits text into subword units based on frequently co-occurring byte pairs. Widely adopted as the tokenizer in large language models such as GPT and BERT.

Natural Language Processing (NLP)

The broad field of technology concerned with processing, understanding, and generating human language by computer.

Machine Translation

Technology that automatically translates text from one language to another using a computer.

Line Height

The vertical spacing between lines of text. Controlled by the CSS line-height property, it significantly affects readability.

Whitespace

Invisible characters such as spaces, tabs, and newlines. They play important roles in text processing and layout.

Ligature

A typographic technique that combines two or more characters into a single glyph. Common examples include fi, fl, and ff.

Kerning

The technique of adjusting spacing between adjacent characters to achieve visually even spacing based on character combinations.

Speech Bubble

A graphic element enclosing character dialogue in comics and chat UIs. Character count constraints within limited space are closely tied to design.

Ruby Annotation

Small characters placed above a base character to indicate its pronunciation. Implemented in HTML using the ruby element.

Character Width

The horizontal space each character occupies when text is displayed. Encompasses the distinction between full-width and half-width characters, as well as variable widths in proportional fonts.

Punctuation

A collective term for symbols used in writing to clarify sentence structure and meaning, including periods, commas, quotation marks, and brackets. Their types, usage rules, and character widths vary by language and region.

Vertical Writing

A writing direction in which text flows from top to bottom and lines progress from right to left. Traditionally used in Japanese and Chinese, it is implemented on the web using the CSS writing-mode property.

Word Wrap

The automatic process of breaking text onto the next line when it exceeds the width of the display area. Whether the break occurs mid-word or at word boundaries depends on the language and settings.

Quotation Mark

A punctuation mark used in text to indicate quotations, dialogue, emphasis, or titles. The shape varies by language and region.

Hyphens and Dashes

Horizontal line symbols used in text for joining words, indicating ranges, and setting off parenthetical phrases.

Font (Typeface)

A dataset that defines the visual design of characters. Directly affects character display width and readability.

Indentation

A formatting technique that inserts whitespace at the beginning of a line to indicate paragraph starts or hierarchical structure.

Letter Spacing (Tracking)

The distance between characters in text. Controlled by CSS letter-spacing, it affects readability and design impression.

JSON

JavaScript Object Notation, a lightweight data interchange format that is easy for both humans and machines to read.

CSV

Comma-Separated Values, a text format that represents data with comma delimiters. Widely used for exchanging tabular data.

XML

Extensible Markup Language, a markup language that describes data structure using tags.

YAML

YAML Ain't Markup Language, an indentation-based human-readable data serialization format.

Markdown

A lightweight markup language that adds formatting to plain text using simple syntax, convertible to HTML.

HTML Entity

Character references for representing special characters in HTML. Starts with & and ends with ;.

MIME Type

A standard classification system for identifying file and data types. Expressed in type/subtype format.

QR Code

A type of 2D barcode capable of storing up to 7,089 digits or approximately 1,800 kanji characters, with built-in error correction.

SSID

A Wi-Fi network identifier with a maximum of 32 bytes, configured on routers to distinguish between access points.

Compression Ratio

In data compression, the ratio of the compressed size to the original size. Text data is highly redundant and can typically achieve compression ratios of 60% to 80%.

Entropy (Information Content)

A measure of uncertainty in information theory. Higher entropy in text means it is harder to predict and more difficult to compress; lower entropy indicates redundancy.

Hash Value

A fixed-length value generated from arbitrary-length data using a hash function. Used for data integrity verification and tamper detection.

Checksum

A value computed for error detection in data. Used to verify data integrity during transfer and storage.

Encryption

The process of converting data into an unreadable format. Only those with the decryption key can restore the original data.

Plain Text

Unencrypted text data that is directly readable by humans.

Sanitization

The process of removing or neutralizing harmful code and invalid characters from user input. A fundamental defense against XSS and SQL injection.

Digital Signature

A mechanism that uses cryptographic techniques to prove the identity of a data creator and verify data integrity.

Screen Reader

Assistive technology that reads aloud text and UI elements on screen. Supports web access for visually impaired users.

ARIA Label

An attribute defined in the WAI-ARIA specification that provides an accessible name to UI elements. Specifies text read by screen readers.

Contrast Ratio

A numerical ratio of luminance difference between foreground and background colors. WCAG requires 4.5:1 or higher for text readability.

Semantic HTML

Using HTML elements that clearly convey content meaning and structure. Properly using elements like header, nav, main, article, and section.

Focus Indicator

A visual display showing which element currently has keyboard focus. Typically shown as an outline or highlight.

Text-to-Speech (TTS)

Technology that converts text data into speech. Foundation technology for screen readers and voice assistants.

Text Measurement