BMP (Basic Multilingual Plane)

The first 65,536 code points (U+0000 to U+FFFF) in Unicode. Most characters used in everyday writing are found here; characters outside this range require surrogate pairs in UTF-16.

The BMP (Basic Multilingual Plane) is the first of Unicode's 17 "planes" (Plane 0). It spans code points U+0000 through U+FFFF, covering 65,536 positions. This single plane contains ASCII, Latin, Greek, Cyrillic, Arabic, Hiragana, Katakana, and the core set of CJK Unified Ideographs (roughly 20,000 characters), making it home to the vast majority of the world's commonly used writing systems.

Beyond the BMP lie 16 supplementary planes. Plane 1 (SMP, Supplementary Multilingual Plane) holds emoji, ancient scripts, and musical symbols. Plane 2 (SIP, Supplementary Ideographic Plane) contains CJK Unified Ideographs Extension B and later additions. Characters in these supplementary planes have code points at U+10000 and above.

The distinction between the BMP and supplementary planes matters most in environments that use UTF-16 encoding. In UTF-16, a BMP character fits in a single 2-byte code unit, but a supplementary character requires 4 bytes (two code units forming a surrogate pair). JavaScript's String.length returns the number of code units, so strings containing emoji or rare kanji from supplementary planes report a length larger than the visible character count.

Consider some concrete examples. The letter "A" (U+0041, BMP) has a length of 1. The rare kanji "𠮷" (U+20BB7, CJK Extension B) lies outside the BMP, so its length is 2. The emoji "😀" (U+1F600) is also outside the BMP, giving a length of 2. The family emoji "👨‍👩‍👧‍👦" consists of 7 code points (4 of which are outside the BMP), resulting in a length of 11.

Accurate character counting requires a method that does not distinguish between BMP and non-BMP characters. In JavaScript, Array.from(str).length or [...str].length counts code points rather than code units. For even greater accuracy, Intl.Segmenter counts grapheme clusters, correctly treating combining characters and emoji sequences as single characters. Unicode reference books on Amazon provide thorough coverage of these topics.

The BMP itself contains several special regions. U+D800 through U+DFFF is reserved for surrogates and does not represent valid characters on its own. U+E000 through U+F8FF is the Private Use Area, where font vendors and organizations can assign custom characters. U+FDD0 through U+FDEF and the last two code points of every plane are designated as "noncharacters," permanently excluded from character assignment.

Share this article