Unicode

A universal character encoding standard that covers over 140,000 characters from all writing systems worldwide.

Unicode is an international character encoding standard designed to represent every character from every writing system in the world. As of 2024, it includes over 140,000 characters, including emoji.

Before Unicode, different regions used incompatible encodings: Shift_JIS for Japanese, GB2312 for Chinese, KS X 1001 for Korean. Unicode unified these into a single system. Unicode standard references provide comprehensive coverage of the specification.

Unicode has several encoding forms: UTF-8 (variable-length, web standard), UTF-16 (used internally by JavaScript and Java), and UTF-32 (fixed-length). UTF-8 is the dominant encoding on the web.

Character counting with Unicode requires care: surrogate pairs, combining characters, and emoji sequences can cause discrepancies. Unicode programming books explain how to handle these edge cases correctly.