Mojibake (Character Corruption)

A phenomenon where text displays as garbled symbols or incorrect characters due to a mismatch between the encoding used to write the data and the encoding used to read it.

Mojibake is a Japanese term that has been adopted internationally to describe the garbled text that appears when a text file is decoded using a different character encoding than the one it was encoded with. For example, opening a UTF-8 encoded Japanese file as Shift_JIS produces strings of meaningless characters, while opening a Shift_JIS file as UTF-8 floods the screen with replacement characters (U+FFFD) or completely wrong glyphs.

Three scenarios account for most mojibake occurrences. First, a file is saved in one encoding but opened in another. Second, a database connection's character set does not match the table's character set. Third, the Content-Type HTTP response header specifies a charset that differs from the actual encoding of the HTML file. All three share the same root cause: the writer and reader disagree on the encoding contract.

Historically, mojibake has been closely tied to Japanese computing culture. From the 1980s through the 1990s, three major Japanese encodings coexisted: JIS, Shift_JIS, and EUC-JP. This fragmentation caused rampant character corruption in email and web pages. Email was especially problematic because ISO-2022-JP was the nominal standard, yet different mail clients would silently use other encodings, producing garbled messages on the receiving end.

Today, UTF-8 has become the de facto standard for the web, and the frequency of mojibake has dropped dramatically. According to W3Techs, over 98% of websites now use UTF-8. However, mojibake has not disappeared entirely. It still surfaces when interfacing with legacy systems, exchanging CSV files (Excel expects BOM-prefixed UTF-8), or migrating old databases that used region-specific encodings. Books on character encoding (Amazon) cover these migration challenges in depth.

Preventing mojibake in practice comes down to a few clear rules. Save files as UTF-8 (without BOM). Set database character sets to utf8mb4. Include Content-Type: text/html; charset=UTF-8 in HTTP responses. When generating CSV files for Excel, output BOM-prefixed UTF-8. Following these conventions consistently eliminates the vast majority of mojibake in modern development environments.

From a character counting perspective, mojibake-affected text shows a large discrepancy between the visible character count and the actual byte count. For instance, when UTF-8 Japanese text is misinterpreted as Latin-1, each character appears to expand into roughly three characters, inflating the character count to nearly triple its true value. Accurate character counting depends on the text's encoding being correctly identified in the first place.

Share this article