字符数与字节数的区别 - 理解 UTF-8 与编码差异

8 分钟阅读

In programming and database design, understanding the difference between "character count" and "byte count" is essential. Languages like Japanese and Chinese use multibyte characters, where one visible character can occupy multiple bytes. Misunderstanding this distinction leads to data truncation, encoding errors, and corrupted text.

Byte Counts by Encoding

EncodingASCII (A-Z, 0-9)CJK CharactersEmoji
UTF-81 byte3 bytes4 bytes
UTF-162 bytes2 bytes4 bytes
ASCII1 byteNot supportedNot supported

For example, the word "Hello" is 5 bytes in UTF-8, while a 5-character Chinese phrase would be 15 bytes in UTF-8 but only 10 bytes in UTF-16.

Common Pitfalls

Language-Specific String Length Behavior

LanguageLength MethodReturnsLength of "🎉"Length of "𠮷"Accurate Character Count
JavaScript.lengthUTF-16 code units22[...str].length
Python 3len()Code points11len(s.encode('utf-8')) for bytes
Java.length()UTF-16 code units22.codePointCount(0, s.length())
Golen()Bytes44utf8.RuneCountInString()
Rust.len()Bytes44.chars().count()
Swift.countGrapheme clusters11.utf8.count for bytes

Swift's design is particularly noteworthy. Its .count property returns the number of grapheme clusters, so ZWJ-joined emoji like 👨‍👩‍👧‍👦 correctly count as 1 — the closest approximation to "what the user sees." The trade-off is that Swift strings cannot be indexed in O(1) time; traversal from the beginning is required.

Rust takes the most rigorous approach to string handling. The String type is internally stored as a UTF-8 byte sequence, and index access like s[0] is a compile-time error. This forces developers to explicitly choose between byte-level and character-level access — a design philosophy that prevents encoding bugs at the language level.

How UTF-8 Works

UTF-8 is the current web standard encoding, capable of representing every character in the Unicode standard. It uses a variable-length scheme, allocating 1 to 4 bytes per character depending on the code point range. The core of this design is that the leading bit pattern of each byte uniquely identifies whether it is a start byte or a continuation byte, and how many bytes the character occupies. For a thorough treatment of encoding systems, books on character encoding provide valuable reference material.

A particularly elegant aspect of this design is its "self-synchronization" property. You can start reading from any arbitrary position in a byte stream and immediately identify character boundaries by examining the leading bit patterns. Continuation bytes always start with 10, making them distinguishable from start bytes. This enables partial recovery from corrupted data and efficient random access within a file. UTF-16 lacks this self-synchronization property — starting to read mid-stream risks misinterpreting the first and second halves of a surrogate pair.

UTF-8's key advantage is backward compatibility with ASCII. English text remains exactly 1 byte per character, which is why existing ASCII-based systems work seamlessly with UTF-8. Additionally, UTF-8 byte sequences sort in the same order as Unicode code points. This means simple byte-level comparison produces correct lexicographic ordering, which is advantageous for database indexes and file system sorting.

Legacy Encodings: Shift_JIS and EUC-JP

Shift_JIS and EUC-JP were developed specifically for Japanese text and were widely used for decades. While UTF-8 migration is well underway, these encodings still appear in legacy systems, email transmission, and CSV file handling.

Shift_JIS gets its name from the way it "shifts" byte values into unused regions of JIS X 0201 (half-width katakana) to coexist with ASCII. However, this design introduced the notorious "0x5C problem": certain kanji have 0x5C (the ASCII backslash "\") as their second byte, which collides with C escape sequences and file path separators. Characters like 表 (0x955C), 能 (0x945C), and ソ (0x835C) are well-known problem characters that can cause bugs when used in file names or paths — an issue that is still reported today.

EUC-JP was widely used in UNIX environments. Because its second byte always falls within the 0xA1–0xFE range, it avoids the 0x5C problem entirely. This safety characteristic was one of the reasons EUC-JP was preferred in UNIX environments over Shift_JIS.

FeatureShift_JISEUC-JP
Primary useWindows, legacy systemsUNIX/Linux environments
Bytes per CJK character2 bytes2 bytes
Emoji supportNoNo
Multilingual supportJapanese onlyJapanese only
Current recommendationDeprecated (legacy only)Deprecated (legacy only)

Practical Considerations for Developers

The character-vs-byte distinction causes real problems in everyday development. Here are the most common scenarios to watch for.

Text ExampleCharactersUTF-8 BytesUTF-16 BytesByte/Char Ratio (UTF-8)
Hello55101.0
café4581.25
日本語3963.0
𠮷野家31083.3
🎉🎊🎈312124.0
👨‍👩‍👧‍👦 (family emoji)Visually 12522

The "family emoji" in the last row is a composite character formed by joining four individual emoji with ZWJ (Zero Width Joiner, U+200D). It appears as a single character but internally consists of 7 code points (4 emoji + 3 ZWJ characters). JavaScript's String.length returns 11, and [...str].length returns 7. This is a prime example of how "visual character count" and "internal character count" can diverge dramatically, and it represents the most challenging case for character counting implementations.

Techniques Used by Experienced Engineers

Engineers who deal with encoding issues daily rely on a set of practical techniques to avoid common traps.

Real-World Encoding Failures

Encoding issues can cause large-scale system failures. Here are patterns that have caused real production incidents.

Surprising Facts About Character Encoding

UTF-8 was designed in 1992 by Rob Pike and Ken Thompson. According to a well-known anecdote, the initial scheme was sketched on the back of a placemat at a diner in New Jersey. At the time, multiple Unicode encoding proposals were competing, and the decisive factor in UTF-8's adoption was its compatibility with C's null-terminated strings. In UTF-8, the null byte (0x00) never appears except as the ASCII NUL character, so C functions like strlen() and strcpy() work without modification. Without this property, the vast existing codebase of C/UNIX software would have required rewriting, and adoption would have been significantly delayed.

Another lesser-known fact: while most Japanese kanji occupy 3 bytes in UTF-8, certain rare kanji in the CJK Unified Ideographs Extension B block and beyond require 4 bytes. The character "𠮷" (an alternate form of 吉 used in the restaurant chain Yoshinoya's official name) is one such 4-byte character. Technically, kanji within the BMP (Basic Multilingual Plane, U+0000–U+FFFF) are 3 bytes in UTF-8, while those placed in supplementary planes (U+10000 and above) require 4 bytes. Extension B alone contains approximately 42,000 characters, and many variant characters used in personal and place names fall within this range.

Conclusion

Understanding the character-vs-byte distinction is fundamental to building robust software. For new projects, standardize on UTF-8 (specifically utf8mb4 in MySQL), always truncate by character count rather than byte count, and prepare test cases that include emoji and variant characters. These practices are the most effective way to prevent encoding-related issues before they reach production. Use Character Counter to check both character and byte counts for your text.