Variable-Length Encoding

An encoding scheme where different characters use different numbers of bytes. UTF-8 (1 to 4 bytes) and Shift_JIS (1 to 2 bytes) are representative examples, achieving efficiency by representing frequently used characters with shorter byte sequences.

Variable-length encoding is a character encoding scheme that assigns different numbers of bytes to different characters, rather than representing every character with the same fixed number of bytes. The opposite approach is fixed-length encoding, where UTF-32 (all characters use 4 bytes) and ASCII (all characters use 1 byte) are typical examples.

The primary advantage of variable-length encoding is space efficiency. In UTF-8, ASCII characters (letters, digits, symbols) use just 1 byte, Japanese kanji and kana use 3 bytes, and some emoji use 4 bytes. For the predominantly English-language web, this means a 75% reduction in data size compared to UTF-32. Even for Japanese text, UTF-8's 3 bytes per character is 25% more compact than UTF-32's 4 bytes per character.

UTF-8's variable-length design is elegantly engineered. The bit pattern of the leading byte alone reveals how many bytes the character occupies: 1-byte characters start with 0xxxxxxx, 2-byte characters with 110xxxxx 10xxxxxx, 3-byte characters with 1110xxxx 10xxxxxx 10xxxxxx, and 4-byte characters with 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. This self-synchronizing property means that even if you start reading from the middle of a byte stream, you can reliably detect character boundaries. UTF-8 encoding references on Amazon explain these bit patterns in detail.

Shift_JIS is also a variable-length encoding. ASCII-compatible characters and half-width katakana use 1 byte, while kanji and hiragana use 2 bytes. However, Shift_JIS lacks UTF-8's self-synchronizing property, meaning that starting to read from the middle of a byte stream can cause the first and second bytes of a character to be confused. This design weakness makes text processing with Shift_JIS considerably more complex.

The biggest pitfall of variable-length encoding is that "character count" and "byte count" do not match. In UTF-8, "Hello" is 5 bytes (5 characters), but "こんにちは" is 15 bytes (still 5 characters). Accessing the nth character in a string requires scanning from the beginning and counting bytes sequentially; you cannot simply multiply by a fixed byte width as you would with a fixed-length encoding. This affects string operation performance in programming languages.

For database and file system capacity planning, the characteristics of variable-length encoding must be properly understood. MySQL's VARCHAR(255) with UTF-8 (utf8mb4) allows a maximum of 255 characters, but the maximum byte count is 1,020 bytes (255 x 4). Storage estimates based on character count rather than byte count will diverge significantly from reality.

Share this article