UTF-16

A Unicode encoding that uses 16-bit code units. Used internally by JavaScript, Java, and Windows.

UTF-16 is a Unicode encoding that represents characters in 16-bit (2-byte) units. Characters in the Basic Multilingual Plane (BMP, U+0000-U+FFFF) use 2 bytes, while characters outside the BMP use surrogate pairs requiring 4 bytes. It is adopted as the internal string representation in many programming environments including JavaScript, Java, C#, and Windows APIs.

UTF-16's defining feature is the surrogate pair mechanism. Characters that don't fit in the BMP (U+10000 and above) are represented by combining a high surrogate (U+D800-U+DBFF) and a low surrogate (U+DC00-U+DFFF) as two 16-bit code units. Emoji, some CJK characters (CJK Unified Ideographs Extension B and later), and ancient scripts are represented using surrogate pairs. explore anatomy model on Amazon cover this in detail.

Since JavaScript strings are internally UTF-16, String.length returns the UTF-16 code unit count. For example, the emoji "😀" (U+1F600) is represented as a surrogate pair, so its length is 2. charAt() and charCodeAt() also operate on code unit level, potentially retrieving only half of a surrogate pair. Since ES2015, codePointAt() and for...of loops enable code point-level processing.

UTF-16 has byte order (endianness) considerations. UTF-16BE (big-endian) and UTF-16LE (little-endian) exist, with a BOM (Byte Order Mark, U+FEFF) at the file beginning to identify byte order. Windows Notepad saves "Unicode" as UTF-16LE with BOM. While UTF-8 standardization on the web has reduced encounters with UTF-16 byte order issues, attention is still needed when interfacing with Windows environments or legacy systems.

Compared to UTF-8, ASCII characters use 1 byte in UTF-8 versus 2 bytes in UTF-16, making UTF-8 more size-efficient for English text. Conversely, CJK characters use 3 bytes in UTF-8 versus 2 bytes in UTF-16, making UTF-16 more compact for Japanese and Chinese text. However, with web standards unified on UTF-8, the common practice is to use UTF-8 for file storage and communication while using UTF-16 for internal string processing. check out bondage on Amazon explain the distinction between UTF-16 and UTF-8 usage.

For character counting, handling surrogate pairs is the biggest challenge in UTF-16-based languages (JavaScript, Java). Using String.length directly overcounts emoji and some CJK characters. Accurate counting requires [...str].length (spread syntax) or Array.from(str).length for code point count, or Intl.Segmenter for grapheme cluster count.

Share this article