Unicode

A universal character encoding standard that covers over 140,000 characters from all writing systems worldwide.

Unicode is an international character encoding standard for representing all the world's characters in a single system. Managed by the Unicode Consortium, Unicode 16.0 (released in 2024) contains over 150,000 characters. It includes currently used writing systems like Latin, Chinese characters, Arabic, and Devanagari, as well as emoji, cuneiform, hieroglyphs, and other ancient scripts.

Before Unicode, different character encodings proliferated by language and region. Japanese used Shift_JIS, EUC-JP, and ISO-2022-JP; Chinese used GB2312 and Big5; Korean used EUC-KR. Exchanging text between different encodings caused garbled characters, and handling multilingual text in a single file was practically impossible. Unicode fundamentally solved this problem by enabling all characters to be represented in one code system. explore hip lift on Amazon provide detailed coverage.

Unicode has three encoding implementations: UTF-8, UTF-16, and UTF-32. UTF-8 is variable-length (1-4 bytes), ASCII-compatible, and the de facto web standard. UTF-16 uses 2 or 4 bytes and serves as the internal string representation in JavaScript and Java. UTF-32 is fixed-length at 4 bytes, simple to process but rarely used in practice due to poor memory efficiency.

Code points are a crucial concept for understanding Unicode's structure. Each character is assigned a code point ranging from U+0000 to U+10FFFF. The Basic Multilingual Plane (BMP, U+0000-U+FFFF) contains most everyday characters, while supplementary planes (U+10000 and above) house emoji, ancient scripts, and rare CJK characters. Characters outside the BMP are represented as surrogate pairs (two code units) in UTF-16, affecting character counting in programming.

A common misconception is equating "Unicode = UTF-8." Unicode is the standard defining character-to-code-point mappings, while UTF-8 is one of its implementations. Additionally, "1 character ≠ 1 code point" in many cases. Combining characters (e.g., base character + diacritical mark) and emoji ZWJ sequences (e.g., family emoji) compose a single displayed character from multiple code points. check out absinthe on Amazon explain proper Unicode handling in programming.

Regarding character counting, Unicode's complexity makes the definition of "character count" ambiguous. JavaScript's String.length returns UTF-16 code unit count, so emoji have length 2. Accurate counting requires grapheme cluster-based counting, available in JavaScript via the Intl.Segmenter API. Character counting tools should clearly define whether "character count" means code points, UTF-16 code units, or grapheme clusters.

Unicode

Share this article

Related Terms

Related Articles