Characters vs. Bytes: Understanding UTF-8 and Encoding Differences
In programming and database design, understanding the difference between "character count" and "byte count" is essential. Languages like Japanese and Chinese use multibyte characters, where one visible character can occupy multiple bytes. Misunderstanding this distinction leads to data truncation, encoding errors, and corrupted text.
Byte Counts by Encoding
| Encoding | ASCII (A-Z, 0-9) | CJK Characters | Emoji |
|---|---|---|---|
| UTF-8 | 1 byte | 3 bytes | 4 bytes |
| UTF-16 | 2 bytes | 2 bytes | 4 bytes |
| ASCII | 1 byte | Not supported | Not supported |
For example, the word "Hello" is 5 bytes in UTF-8, while a 5-character Chinese phrase would be 15 bytes in UTF-8 but only 10 bytes in UTF-16.
Common Pitfalls
- Database truncation: MySQL's old utf8 encoding only supports up to 3 bytes per character, causing emoji (4 bytes) to fail. Always use utf8mb4.
- API payload limits: A "1,000 character" text field in CJK languages can be up to 3,000 bytes in UTF-8, potentially exceeding API body size limits.
- JavaScript string length:
String.lengthreturns UTF-16 code units, not characters. Emoji may count as 2. Use[...str].lengthfor accurate character counts. - URL encoding expansion: Non-ASCII characters in URLs expand dramatically — each CJK character becomes 9 characters (%XX%XX%XX) in URL encoding.
Language-Specific String Length Behavior
- Python 3:
len()returns Unicode code points. Uselen(s.encode('utf-8'))for byte count. - JavaScript:
String.lengthreturns UTF-16 code units. Use[...str].lengthfor code points. - Go:
len()returns bytes. Useutf8.RuneCountInString()for character count. - Rust:
String::len()returns bytes. Use.chars().count()for characters.
Conclusion
Understanding the character-vs-byte distinction is fundamental to building robust software. Always use UTF-8 (specifically utf8mb4 in MySQL), truncate by character count rather than byte count, and test with multibyte input. Use Character Counter to check both character and byte counts for your text.