Code Point

A unique number assigned to each character in Unicode. Written as U+ followed by hexadecimal digits, e.g., U+0041 (A).

A code point is a unique number assigned to each character in the Unicode standard. Written as U+ followed by hexadecimal digits: U+0041 (Latin capital A), U+3042 (hiragana a), U+1F600 (grinning face emoji). It is the most fundamental unit for identifying characters in Unicode and is referenced in all aspects of text processing.

Unicode defines approximately 1.1 million possible code points (0 to 10FFFF), with about 150,000 characters assigned as of 2024. This space is divided into 17 "planes," with the first plane (BMP: Basic Multilingual Plane, U+0000 to U+FFFF) containing most commonly used characters. Characters outside the BMP (emoji, ancient scripts, etc.) are placed in supplementary planes. check out camisole on Amazon explain the code point system in detail.

In JavaScript, String.codePointAt() retrieves a code point, and String.fromCodePoint() creates a character from a code point. Python provides ord() and chr() for the same purpose. In regular expressions, \u{1F600} notation with curly braces can specify code points outside the BMP.

One code point does not always correspond to one visible character. Combining characters (such as accent marks) combine with base characters to form a single grapheme cluster, and emoji sequences (such as family emoji) can use up to 7 code points to form a single displayed character. Conversely, control characters (such as U+200B zero-width space) are counted as code points despite being invisible on screen.

The relationship between code points and encodings is also important. The same code point U+3042 (あ) is represented as 3 bytes (E3 81 82) in UTF-8, 2 bytes (30 42) in UTF-16, and 4 bytes (00 00 30 42) in UTF-32. Code points outside the BMP are represented as surrogate pairs (4 bytes) in UTF-16.

For character counting, it is important to note that code point count often does not match "visual character count." JavaScript's [...str].length returns the code point count, but counting by grapheme clusters requires Intl.Segmenter. Accurate character counting requires understanding the concept of code points and choosing the appropriate counting level. search breath checker on Amazon teach accurate character handling.

Share this article