Unicode Explained: A Beginner's Guide to Character Encoding
Unicode is the universal standard for representing text in computers. It assigns a unique number (code point) to every character in every writing system, from Latin letters to Chinese ideographs to emoji. Understanding Unicode is essential for anyone working with text across languages and platforms.
What Is Unicode?
Before Unicode, different regions used different encoding systems (ASCII for English, Shift_JIS for Japanese, GB2312 for Chinese). This caused constant compatibility issues when sharing text across systems. Unicode solved this by creating a single encoding that covers all writing systems — currently over 154,000 characters across 168 scripts.
UTF-8, UTF-16, and UTF-32 Compared
| Encoding | Bytes per Character | Best For | Notes |
|---|---|---|---|
| UTF-8 | 1–4 bytes | Web, storage | Most common; backward-compatible with ASCII |
| UTF-16 | 2 or 4 bytes | Windows, Java, JavaScript | Used internally by many programming languages |
| UTF-32 | 4 bytes (fixed) | Internal processing | Simple but memory-intensive |
Code Points and Characters
A Unicode code point is written as U+XXXX (e.g., U+0041 for "A"). The Basic Multilingual Plane (BMP) covers code points U+0000 to U+FFFF and includes most commonly used characters. Supplementary planes (U+10000 and above) contain emoji, historic scripts, and rare characters.
Why Character Count Differs from Byte Count
In UTF-8, an ASCII character (A–Z) uses 1 byte, a European accented character uses 2 bytes, a CJK character uses 3 bytes, and an emoji uses 4 bytes. This means a 10-character string could be anywhere from 10 to 40 bytes depending on the characters used.
Emoji and Unicode
Emoji are Unicode characters, but they are more complex than they appear:
- Basic emoji (😀) use a single code point
- Skin tone variants use two code points (base + modifier)
- Family emoji (👨👩👧👦) use up to 7 code points joined by Zero Width Joiners
- Flag emoji use two Regional Indicator Symbol code points
Practical Implications for Developers
- String length varies by method: JavaScript's
.lengthcounts UTF-16 code units, not characters. An emoji may report as length 2. - Database storage: Use UTF-8 (utf8mb4 in MySQL) to support all Unicode characters including emoji
- URL encoding: Non-ASCII characters in URLs must be percent-encoded
- Sorting: Unicode collation is locale-dependent and complex
Conclusion
Unicode is the foundation of modern text processing. Understanding the difference between characters, code points, and bytes helps you build robust multilingual applications. Use Character Counter to see how your text measures in both characters and bytes.