Character Encoding

A system of rules that maps characters to sequences of bits. It consists of two layers: a character set (which characters are included) and an encoding scheme (how those characters are converted to byte sequences).

Character encoding is the foundational technology that allows computers to handle text. It defines the rules for converting human-readable characters like "A" or "あ" into numeric values that machines can process. Without these conversion rules, storing, transmitting, and displaying text would be impossible.

Character encoding is best understood as two distinct layers. The first layer is the character set, which defines which characters are available. ASCII covers 128 characters, JIS X 0208 covers roughly 6,879 characters, and Unicode covers over 150,000 characters. The second layer is the encoding scheme, which determines how each character in the set is represented as a byte sequence. This is why multiple encoding schemes (UTF-8, UTF-16, UTF-32) can exist for the same Unicode character set.

Japanese character encoding has a particularly complex history. JIS C 6226 (later renamed JIS X 0208) was established in 1978, and from it emerged three competing encoding schemes: Shift_JIS (designed by Microsoft for PCs), EUC-JP (widely adopted on UNIX systems), and ISO-2022-JP (used for email). Because all three represented the same character set with different byte sequences, converting between them frequently caused mojibake (character corruption).

Today, the industry has largely converged on Unicode + UTF-8. UTF-8 maintains full backward compatibility with ASCII, representing English text at 1 byte per character and Japanese text at 3 bytes per character. This variable-length design lets it handle every writing system in the world while remaining compatible with ASCII-based infrastructure. UTF-8 is the default encoding for the web, databases, and most programming languages, and there is virtually no reason to choose anything else for new projects.

Understanding character encoding is essential for accurate character counting. The same character "あ" occupies 3 bytes in UTF-8, 2 bytes in Shift_JIS, and 4 bytes in UTF-32. "Character count" and "byte count" only coincide within the ASCII range; for Japanese text, they always differ. Whether a database's VARCHAR(255) means "255 characters" or "255 bytes" depends on the encoding configuration, and getting this wrong leads to silent data truncation. Character encoding references on Amazon provide deeper coverage of these nuances.

A practical concern when working with legacy systems is that converting from Unicode to older encodings like Shift_JIS can be lossy. Characters that exist in Unicode but not in Shift_JIS (certain emoji, CJK Unified Ideographs Extension B and beyond) are replaced with "?" or "〓" during conversion. Character encoding conversion is a potentially irreversible operation that can result in permanent data loss.

Share this article