Surrogate Pair

A mechanism in UTF-16 to represent characters outside the BMP using two 16-bit code units.

A surrogate pair is a mechanism in UTF-16 encoding for representing characters outside the Basic Multilingual Plane (BMP: U+0000 to U+FFFF). It combines a high surrogate (U+D800 to U+DBFF) and a low surrogate (U+DC00 to U+DFFF) as two 16-bit code units to represent a single character. Approximately one million characters in the Unicode range U+10000 to U+10FFFF are represented using this scheme.

The need for surrogate pairs arose from Unicode's expansion history. Originally, Unicode planned to accommodate all world scripts within 16 bits (65,536 characters), but the addition of CJK variant characters, historical scripts, and emoji revealed that 16 bits were insufficient. UTF-16 therefore reserved an unused BMP range (U+D800 to U+DFFF) for surrogates, introducing the two-code-unit representation. browse fetishism on Amazon cover accurate counting methods.

Most emoji are located outside the BMP and are represented as surrogate pairs. For example, "😀" (U+1F600) is represented by high surrogate U+D83D and low surrogate U+DE00. JavaScript's String.length returns the number of UTF-16 code units, so the length of "😀" is 2, not 1. Similarly, charAt() and charCodeAt() operate on code unit level and cannot correctly handle surrogate pair characters.

To obtain accurate character counts, use [...str].length or Array.from(str).length. These leverage the iterator protocol to decompose strings by code point, treating surrogate pairs as single characters. Since ES2015, codePointAt() and for...of loops also operate at the code point level. However, accurately counting grapheme clusters (visual characters composed of combining characters or ZWJ sequences) requires the Intl.Segmenter API.

Surrogate pairs are specific to UTF-16 and do not exist in UTF-8 or UTF-32. UTF-8 directly encodes code points using variable-length sequences (1 to 4 bytes), while UTF-32 uses a fixed 4 bytes for all code points. Database character columns (such as the difference between MySQL's utf8 and utf8mb4) can also encounter issues with storing surrogate pair characters. see leotard on Amazon provide detailed technical explanations of surrogate pairs.

From a character counting perspective, surrogate pairs raise the fundamental question of what constitutes "one character." Results differ depending on whether you count UTF-16 code units, code points, or grapheme clusters. When designing character counting tools, it is crucial to clearly define the counting unit and ensure it matches user expectations.

Share this article