Characters vs. Bytes: Understanding UTF-8 and Encoding Differences

In programming and database design, understanding the difference between "character count" and "byte count" is essential. Languages like Japanese and Chinese use multibyte characters, where one visible character can occupy multiple bytes. Misunderstanding this distinction leads to data truncation, encoding errors, and corrupted text.

Byte Counts by Encoding

EncodingASCII (A-Z, 0-9)CJK CharactersEmoji
UTF-81 byte3 bytes4 bytes
UTF-162 bytes2 bytes4 bytes
ASCII1 byteNot supportedNot supported

For example, the word "Hello" is 5 bytes in UTF-8, while a 5-character Chinese phrase would be 15 bytes in UTF-8 but only 10 bytes in UTF-16.

Common Pitfalls

Language-Specific String Length Behavior

Conclusion

Understanding the character-vs-byte distinction is fundamental to building robust software. Always use UTF-8 (specifically utf8mb4 in MySQL), truncate by character count rather than byte count, and test with multibyte input. Use Character Counter to check both character and byte counts for your text.