Unicode Explained: A Beginner's Guide to Character Encoding

Unicode is the universal standard for representing text in computers. It assigns a unique number (code point) to every character in every writing system, from Latin letters to Chinese ideographs to emoji. Understanding Unicode is essential for anyone working with text across languages and platforms.

What Is Unicode?

Before Unicode, different regions used different encoding systems (ASCII for English, Shift_JIS for Japanese, GB2312 for Chinese). This caused constant compatibility issues when sharing text across systems. Unicode solved this by creating a single encoding that covers all writing systems — currently over 154,000 characters across 168 scripts.

UTF-8, UTF-16, and UTF-32 Compared

EncodingBytes per CharacterBest ForNotes
UTF-81–4 bytesWeb, storageMost common; backward-compatible with ASCII
UTF-162 or 4 bytesWindows, Java, JavaScriptUsed internally by many programming languages
UTF-324 bytes (fixed)Internal processingSimple but memory-intensive

Code Points and Characters

A Unicode code point is written as U+XXXX (e.g., U+0041 for "A"). The Basic Multilingual Plane (BMP) covers code points U+0000 to U+FFFF and includes most commonly used characters. Supplementary planes (U+10000 and above) contain emoji, historic scripts, and rare characters.

Why Character Count Differs from Byte Count

In UTF-8, an ASCII character (A–Z) uses 1 byte, a European accented character uses 2 bytes, a CJK character uses 3 bytes, and an emoji uses 4 bytes. This means a 10-character string could be anywhere from 10 to 40 bytes depending on the characters used.

Emoji and Unicode

Emoji are Unicode characters, but they are more complex than they appear:

Practical Implications for Developers

  1. String length varies by method: JavaScript's .length counts UTF-16 code units, not characters. An emoji may report as length 2.
  2. Database storage: Use UTF-8 (utf8mb4 in MySQL) to support all Unicode characters including emoji
  3. URL encoding: Non-ASCII characters in URLs must be percent-encoded
  4. Sorting: Unicode collation is locale-dependent and complex

Conclusion

Unicode is the foundation of modern text processing. Understanding the difference between characters, code points, and bytes helps you build robust multilingual applications. Use Character Counter to see how your text measures in both characters and bytes.