Unicode là gì - Bộ đếm ký tự

Unicode

Tiêu chuẩn mã hóa ký tự phổ quát bao gồm hơn 140.000 ký tự từ tất cả hệ thống chữ viết trên thế giới.

Unicode là tiêu chuẩn mã hóa ký tự quốc tế để biểu diễn tất cả ký tự thế giới trong một hệ thống duy nhất. Managed by the Unicode Consortium, Unicode 16.0 (released in 2024) contains over 150,000 characters. It includes currently used writing systems like Latin, Chinese characters, Arabic, and Devanagari, as well as emoji, cuneiform, hieroglyphs, and other ancient scripts.

Trước Unicode, các bảng mã ký tự khác nhau phổ biến theo ngôn ngữ và khu vực. Japanese used Shift_JIS, EUC-JP, and ISO-2022-JP; Chinese used GB2312 and Big5; Korean used EUC-KR. Exchanging text between different encodings caused garbled characters, and handling multilingual text in a single file was practically impossible. Unicode fundamentally solved this problem by enabling all characters to be represented in one code system. Unicode reference books provide detailed coverage.

Unicode có ba triển khai mã hóa: UTF-8, UTF-16, and UTF-32. UTF-8 is variable-length (1-4 bytes), ASCII-compatible, and the de facto web standard. UTF-16 uses 2 or 4 bytes and serves as the internal string representation in JavaScript and Java. UTF-32 is fixed-length at 4 bytes, simple to process but rarely used in practice due to poor memory efficiency.

Code point là khái niệm quan trọng để hiểu cấu trúc Unicode. Each character is assigned a code point ranging from U+0000 to U+10FFFF. The Basic Multilingual Plane (BMP, U+0000-U+FFFF) contains most everyday characters, while supplementary planes (U+10000 and above) house emoji, ancient scripts, and rare CJK characters. Characters outside the BMP are represented as surrogate pairs (two code units) in UTF-16, affecting character counting in programming.

Một quan niệm sai lầm phổ biến là đánh đồng "Unicode = UTF-8." Unicode is the standard defining character-to-code-point mappings, while UTF-8 is one of its implementations. Additionally, "1 character ≠ 1 code point" in many cases. Combining characters (e.g., base character + diacritical mark) and emoji ZWJ sequences (e.g., family emoji) compose a single displayed character from multiple code points. Character encoding technology books explain proper Unicode handling in programming.

Về đếm ký tự, sự phức tạp của Unicode khiến định nghĩa "character count" trở nên mơ hồ. Của JavaScript, String.length returns UTF-16 code unit count, so emoji have length 2. Accurate counting requires grapheme cluster-based counting, available in JavaScript via the Intl.Segmenter API. Character counting tools should clearly define whether "character count" means code points, UTF-16 code units, or grapheme clusters.

Unicode

Thuật ngữ liên quan

Bài viết liên quan