字符数与字节数的区别 - 理解 UTF-8 与编码差异
In programming and database design, understanding the difference between "character count" and "byte count" is essential. Languages like Japanese and Chinese use multibyte characters, where one visible character can occupy multiple bytes. Misunderstanding this distinction leads to data truncation, encoding errors, and corrupted text.
Byte Counts by Encoding
| Encoding | ASCII (A-Z, 0-9) | CJK Characters | Emoji |
|---|---|---|---|
| UTF-8 | 1 byte | 3 bytes | 4 bytes |
| UTF-16 | 2 bytes | 2 bytes | 4 bytes |
| ASCII | 1 byte | Not supported | Not supported |
For example, the word "Hello" is 5 bytes in UTF-8, while a 5-character Chinese phrase would be 15 bytes in UTF-8 but only 10 bytes in UTF-16.
Common Pitfalls
- Database truncation: MySQL's old utf8 encoding only supports up to 3 bytes per character, causing emoji (4 bytes) to fail. Always use utf8mb4. Note that MySQL 8.0+ defaults to utf8mb4, but systems upgraded from 5.7 or earlier retain the old setting
- API payload limits: A "1,000 character" text field in CJK languages can be up to 3,000 bytes in UTF-8, potentially exceeding API body size limits. If Base64 encoding is involved, data size increases by approximately 33%, further reducing the effective limit
- JavaScript string length:
String.lengthreturns UTF-16 code units, not characters. Emoji may count as 2. Use[...str].lengthfor accurate character counts - URL encoding expansion: Non-ASCII characters in URLs expand dramatically — each CJK character becomes 9 characters (%XX%XX%XX) in URL encoding. With a practical URL limit of approximately 2,000 characters, a URL path containing only CJK characters reaches this limit at around 220 characters
- CSV encoding mismatch: UTF-8 CSV files opened in Excel on Windows display garbled text because Excel assumes the legacy Shift_JIS encoding when no BOM is present. Adding a UTF-8 BOM (3 bytes:
EF BB BF) resolves the issue. Note that macOS Excel correctly recognizes BOM-less UTF-8, so this is primarily a Windows issue - GraphQL query size limits: Many GraphQL servers enforce query string size limits in bytes. Queries containing CJK variable values consume roughly 3× the bytes compared to English-only queries, hitting complexity limits unexpectedly early
Language-Specific String Length Behavior
| Language | Length Method | Returns | Length of "🎉" | Length of "𠮷" | Accurate Character Count |
|---|---|---|---|---|---|
| JavaScript | .length | UTF-16 code units | 2 | 2 | [...str].length |
| Python 3 | len() | Code points | 1 | 1 | len(s.encode('utf-8')) for bytes |
| Java | .length() | UTF-16 code units | 2 | 2 | .codePointCount(0, s.length()) |
| Go | len() | Bytes | 4 | 4 | utf8.RuneCountInString() |
| Rust | .len() | Bytes | 4 | 4 | .chars().count() |
| Swift | .count | Grapheme clusters | 1 | 1 | .utf8.count for bytes |
Swift's design is particularly noteworthy. Its .count property returns the number of grapheme clusters, so ZWJ-joined emoji like 👨👩👧👦 correctly count as 1 — the closest approximation to "what the user sees." The trade-off is that Swift strings cannot be indexed in O(1) time; traversal from the beginning is required.
Rust takes the most rigorous approach to string handling. The String type is internally stored as a UTF-8 byte sequence, and index access like s[0] is a compile-time error. This forces developers to explicitly choose between byte-level and character-level access — a design philosophy that prevents encoding bugs at the language level.
How UTF-8 Works
UTF-8 is the current web standard encoding, capable of representing every character in the Unicode standard. It uses a variable-length scheme, allocating 1 to 4 bytes per character depending on the code point range. The core of this design is that the leading bit pattern of each byte uniquely identifies whether it is a start byte or a continuation byte, and how many bytes the character occupies. For a thorough treatment of encoding systems, books on character encoding provide valuable reference material.
- ASCII characters (letters, digits, basic symbols): 1 byte — leading bit pattern
0xxxxxxx. 7 effective bits, representing 128 characters - Extended Latin, Greek, Cyrillic, and similar scripts: 2 bytes — leading bits
110xxxxx 10xxxxxx. 11 effective bits, covering U+0080–U+07FF (1,920 characters) - CJK characters (Chinese, Japanese, Korean): 3 bytes — leading bits
1110xxxx 10xxxxxx 10xxxxxx. 16 effective bits, covering U+0800–U+FFFF (approximately 63,000 characters) - Emoji and supplementary characters: 4 bytes — leading bits
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. 21 effective bits, covering U+10000–U+10FFFF (approximately 1 million characters)
A particularly elegant aspect of this design is its "self-synchronization" property. You can start reading from any arbitrary position in a byte stream and immediately identify character boundaries by examining the leading bit patterns. Continuation bytes always start with 10, making them distinguishable from start bytes. This enables partial recovery from corrupted data and efficient random access within a file. UTF-16 lacks this self-synchronization property — starting to read mid-stream risks misinterpreting the first and second halves of a surrogate pair.
UTF-8's key advantage is backward compatibility with ASCII. English text remains exactly 1 byte per character, which is why existing ASCII-based systems work seamlessly with UTF-8. Additionally, UTF-8 byte sequences sort in the same order as Unicode code points. This means simple byte-level comparison produces correct lexicographic ordering, which is advantageous for database indexes and file system sorting.
Legacy Encodings: Shift_JIS and EUC-JP
Shift_JIS and EUC-JP were developed specifically for Japanese text and were widely used for decades. While UTF-8 migration is well underway, these encodings still appear in legacy systems, email transmission, and CSV file handling.
Shift_JIS gets its name from the way it "shifts" byte values into unused regions of JIS X 0201 (half-width katakana) to coexist with ASCII. However, this design introduced the notorious "0x5C problem": certain kanji have 0x5C (the ASCII backslash "\") as their second byte, which collides with C escape sequences and file path separators. Characters like 表 (0x955C), 能 (0x945C), and ソ (0x835C) are well-known problem characters that can cause bugs when used in file names or paths — an issue that is still reported today.
EUC-JP was widely used in UNIX environments. Because its second byte always falls within the 0xA1–0xFE range, it avoids the 0x5C problem entirely. This safety characteristic was one of the reasons EUC-JP was preferred in UNIX environments over Shift_JIS.
| Feature | Shift_JIS | EUC-JP |
|---|---|---|
| Primary use | Windows, legacy systems | UNIX/Linux environments |
| Bytes per CJK character | 2 bytes | 2 bytes |
| Emoji support | No | No |
| Multilingual support | Japanese only | Japanese only |
| Current recommendation | Deprecated (legacy only) | Deprecated (legacy only) |
Practical Considerations for Developers
The character-vs-byte distinction causes real problems in everyday development. Here are the most common scenarios to watch for.
- Database column definitions: Whether
VARCHAR(255)means "255 characters" or "255 bytes" depends on the DBMS. In MySQL with utf8mb4, VARCHAR(255) means 255 characters and can require up to 1,020 bytes. In Oracle Database's default configuration, VARCHAR2(255) means 255 bytes, which can hold only about 85 CJK characters. PostgreSQL always uses character-based semantics, guaranteeing 255 characters for VARCHAR(255). Consult database design guides for detailed DBMS-specific recommendations - API request size limits: Most APIs enforce limits in bytes, not characters. CJK text consumes roughly 3× the bytes of English text for the same character count. JSON key names and metadata overhead are also counted, further reducing the effective character capacity
- SMS character limits: A single SMS supports 160 ASCII characters but only 70 characters when using Unicode (required for CJK, emoji, and most non-Latin scripts). This is because SMS uses GSM 7-bit encoding (7 bits × 160 = 1,120 bits) and UCS-2 encoding (16 bits × 70 = 1,120 bits) interchangeably — both fit within the same 140-byte physical payload
- File size estimation: Text file size is determined by byte count, not character count. A 10,000-character Japanese document is roughly 30 KB in UTF-8. CRLF line endings (Windows) add extra bytes per line compared to LF (Unix)
- String truncation: Truncating by byte count can split a multibyte character mid-sequence, producing corrupted output or mojibake. In UTF-8, invalid truncation can be detected by checking leading bit patterns — if the final byte is a continuation byte (10xxxxxx), the truncation point is mid-character
| Text Example | Characters | UTF-8 Bytes | UTF-16 Bytes | Byte/Char Ratio (UTF-8) |
|---|---|---|---|---|
| Hello | 5 | 5 | 10 | 1.0 |
| café | 4 | 5 | 8 | 1.25 |
| 日本語 | 3 | 9 | 6 | 3.0 |
| 𠮷野家 | 3 | 10 | 8 | 3.3 |
| 🎉🎊🎈 | 3 | 12 | 12 | 4.0 |
| 👨👩👧👦 (family emoji) | Visually 1 | 25 | 22 | — |
The "family emoji" in the last row is a composite character formed by joining four individual emoji with ZWJ (Zero Width Joiner, U+200D). It appears as a single character but internally consists of 7 code points (4 emoji + 3 ZWJ characters). JavaScript's String.length returns 11, and [...str].length returns 7. This is a prime example of how "visual character count" and "internal character count" can diverge dramatically, and it represents the most challenging case for character counting implementations.
Techniques Used by Experienced Engineers
Engineers who deal with encoding issues daily rely on a set of practical techniques to avoid common traps.
- "Bytes ÷ 3" rule of thumb: To quickly estimate the character count of CJK text in UTF-8, divide the byte count by 3. For example, 3,000 bytes of Japanese text is roughly 1,000 characters. Mixed-language text will have a slightly higher character count. In typical Japanese technical documents where 20–30% of content is ASCII, the actual byte-to-character ratio is around 2.4–2.7
- Always use utf8mb4 in MySQL: The older
utf8(utf8mb3) encoding only supports up to 3 bytes per character, which means emoji cannot be stored. Always specifyutf8mb4for new projects. When migrating, note that utf8mb4 index keys consume up to 4 bytes per character — a VARCHAR(255) column uses 1,020 bytes of the InnoDB maximum index key length of 3,072 bytes. Verify that composite indexes don't exceed this limit - Truncate by character count, not bytes: Byte-based truncation can split multibyte characters. In Python, use
text[:100](character-based). In JavaScript, use[...text].slice(0, 100).join('')(code point-based). However, even code point-based truncation can split ZWJ-joined emoji mid-sequence. For complete grapheme cluster-aware truncation, use JavaScript'sIntl.SegmenterAPI - Be aware of BOM: The UTF-8 BOM (byte order mark,
EF BB BF) is 3 bytes at the start of a file. Some JSON parsers reject files with a BOM. Use BOM-less UTF-8 for programmatic files, but BOM-prefixed UTF-8 for CSV files opened in Excel. Shell scripts with a BOM at the start will fail to execute because the shebang (#!/bin/bash) is not recognized - Watch for surrogate pairs: JavaScript's
String.lengthreturns UTF-16 code units, so emoji and some CJK characters report a length of 2 instead of 1. Use[...str].length(spread syntax) for accurate character counts. Regular expressions also require theuflag — without it,/^.$/won't match emoji because they are treated as two separate code units - Encoding detection pitfalls: Automatic encoding detection is not foolproof. Short texts have higher misdetection rates, and Shift_JIS and UTF-8 have overlapping byte patterns that make accurate detection difficult for texts of only a few dozen bytes. The reliable approach is to explicitly specify the encoding at the data source and propagate it as metadata
Real-World Encoding Failures
Encoding issues can cause large-scale system failures. Here are patterns that have caused real production incidents.
- Emoji data corruption: A service using MySQL's
utf8(3-byte limit) experienced data corruption when users posted emoji. Migrating toutf8mb4required a large-scale database migration and extended maintenance downtime. Key technical challenges during migration include table lock duration, index rebuilding, and replication lag management - Government name registration: Administrative systems that only support a limited character set (e.g., JIS Level 1 and 2 kanji) cannot register names containing rare or variant characters. This has been a long-standing issue in Japanese government IT systems. Japan's Digital Agency published a "Character Environment Implementation Guide" in 2023, recommending Unicode adoption as the standard for government systems
- Search engine indexing failures: When a web page's Content-Type header declares one encoding but the actual content uses another, search engines cannot index the page correctly. For example, declaring UTF-8 in the header while serving Shift_JIS content results in garbled search snippets and failure to rank for the intended keywords
- Timezone name logging issues: Regions with non-ASCII timezone names (e.g., "日本標準時" in Japanese environments) can cause log parsing failures when the log collection system only expects ASCII. Standardizing log encoding to UTF-8 and ensuring parsers handle multibyte text is essential
Surprising Facts About Character Encoding
UTF-8 was designed in 1992 by Rob Pike and Ken Thompson. According to a well-known anecdote, the initial scheme was sketched on the back of a placemat at a diner in New Jersey. At the time, multiple Unicode encoding proposals were competing, and the decisive factor in UTF-8's adoption was its compatibility with C's null-terminated strings. In UTF-8, the null byte (0x00) never appears except as the ASCII NUL character, so C functions like strlen() and strcpy() work without modification. Without this property, the vast existing codebase of C/UNIX software would have required rewriting, and adoption would have been significantly delayed.
Another lesser-known fact: while most Japanese kanji occupy 3 bytes in UTF-8, certain rare kanji in the CJK Unified Ideographs Extension B block and beyond require 4 bytes. The character "𠮷" (an alternate form of 吉 used in the restaurant chain Yoshinoya's official name) is one such 4-byte character. Technically, kanji within the BMP (Basic Multilingual Plane, U+0000–U+FFFF) are 3 bytes in UTF-8, while those placed in supplementary planes (U+10000 and above) require 4 bytes. Extension B alone contains approximately 42,000 characters, and many variant characters used in personal and place names fall within this range.
Conclusion
Understanding the character-vs-byte distinction is fundamental to building robust software. For new projects, standardize on UTF-8 (specifically utf8mb4 in MySQL), always truncate by character count rather than byte count, and prepare test cases that include emoji and variant characters. These practices are the most effective way to prevent encoding-related issues before they reach production. Use Character Counter to check both character and byte counts for your text.