UTF-16 là gì - Bộ đếm ký tự

UTF-16

Bảng mã Unicode sử dụng đơn vị mã 16-bit. Được sử dụng nội bộ bởi JavaScript, Java và Windows.

UTF-16 là bảng mã Unicode biểu diễn ký tự trong đơn vị 16-bit (2 byte). Characters in the Basic Multilingual Plane (BMP, U+0000-U+FFFF) use 2 bytes, while characters outside the BMP use surrogate pairs requiring 4 bytes. It is adopted as the internal string representation in many programming environments including JavaScript, Java, C#, and Windows APIs.

Đặc điểm xác định của UTF-16 là cơ chế cặp thay thế. Characters that don't fit in the BMP (U+10000 and above) are represented by combining a high surrogate (U+D800-U+DBFF) and a low surrogate (U+DC00-U+DFFF) as two 16-bit code units. Emoji, some CJK characters (CJK Unified Ideographs Extension B and later), and ancient scripts are represented using surrogate pairs. JavaScript string processing books cover this in detail.

Vì chuỗi JavaScript nội bộ là UTF-16, String.length trả về số đơn vị mã UTF-16. For example, the emoji "😀" (U+1F600) is represented as a surrogate pair, so its length is 2. charAt() and charCodeAt() also operate on code unit level, potentially retrieving only half of a surrogate pair. Since ES2015, codePointAt() and for...of loops enable code point-level processing.

UTF-16 có cân nhắc về thứ tự byte (endianness). UTF-16BE (big-endian) and UTF-16LE (little-endian) exist, with a BOM (Byte Order Mark, U+FEFF) at the file beginning to identify byte order. Windows Notepad saves "Unicode" as UTF-16LE with BOM. While UTF-8 standardization on the web has reduced encounters with UTF-16 byte order issues, attention is still needed when interfacing with Windows environments or legacy systems.

So với UTF-8, ký tự ASCII dùng 1 byte trong UTF-8 so với 2 byte trong UTF-16, making UTF-8 more size-efficient for English text. Conversely, CJK characters use 3 bytes in UTF-8 versus 2 bytes in UTF-16, making UTF-16 more compact for Japanese and Chinese text. However, with web standards unified on UTF-8, the common practice is to use UTF-8 for file storage and communication while using UTF-16 for internal string processing. Programming and character encoding books explain the distinction between UTF-16 and UTF-8 usage.

Đối với đếm ký tự, xử lý cặp thay thế là thách thức lớn nhất trong các ngôn ngữ dựa trên UTF-16 (JavaScript, Java). Sử dụng String.length directly overcounts emoji and some CJK characters. Accurate counting requires [...str].length (spread syntax) or Array.from(str).length for code point count, or Intl.Segmenter for grapheme cluster count.

UTF-16

Thuật ngữ liên quan

Bài viết liên quan