The World of Invisible Characters - Troubles Caused by Zero-Width and Invisible Characters
Your string should be 10 characters, but the system insists it's 12. No matter how hard you look, you can't see any extra characters. The culprit is "zero-width characters" - invisible characters that don't appear on screen at all, yet undeniably exist as data. This article explains the types and purposes of invisible characters defined in Unicode, their impact on character counting, and real-world trouble cases with solutions.
Invisible Character Catalog - Characters That Exist Without Being Seen
Unicode defines multiple characters that are not displayed on screen (or have zero width). These are not "bugs" - they exist for legitimate reasons in text processing.
| Character Name | Code Point | Purpose | Character Count | Display Width |
|---|---|---|---|---|
| Zero Width Space (ZWSP) | U+200B | Specifying line break opportunities | Counted as 1 character | 0 |
| Zero Width Joiner (ZWJ) | U+200D | Joining characters (emoji composition) | Counted as 1 character | 0 |
| Zero Width Non-Joiner (ZWNJ) | U+200C | Preventing character joining | Counted as 1 character | 0 |
| Left-to-Right Mark (LRM) | U+200E | Text direction control | Counted as 1 character | 0 |
| Right-to-Left Mark (RLM) | U+200F | Text direction control | Counted as 1 character | 0 |
| Byte Order Mark (BOM) | U+FEFF | Encoding identification | Usually not counted | 0 |
| Soft Hyphen (SHY) | U+00AD | Specifying hyphenation points | Counted as 1 character | Usually 0 (shown only at line breaks) |
| Word Joiner (WJ) | U+2060 | Specifying no-break positions | Counted as 1 character | 0 |
All of these characters serve legitimate roles in text processing. The problem is that when they unintentionally infiltrate text, they silently throw off character counts.
Zero Width Space (U+200B) - The Most Troublesome Invisible Character
The Zero Width Space (ZWSP) is a character that embeds "you may break the line here" information into text. It's used in languages like Thai and Khmer that don't use spaces between words, allowing browsers to break lines at appropriate positions.
However, ZWSP easily infiltrates text when copying and pasting from web pages, causing troubles like:
- Form input judged as "exceeding character limit" (looks within limit visually)
- Password copy-paste failures (ZWSP infiltrates making it a different string)
- Search mismatches (identical-looking strings don't match in search)
- CSV file data not parsing correctly
- Program source code infiltration causing compile errors
Password infiltration is particularly serious. When ZWSP sneaks into a password copied from a website, you get a situation where the password looks correct but login fails. When considering password length and security, the existence of invisible characters cannot be ignored.
Zero Width Joiner (U+200D) - The Magic Character That Composes Emoji
The Zero Width Joiner (ZWJ) plays the most positive role among invisible characters. As explained in detail in emoji character counting, ZWJ combines multiple emoji to create new ones.
| Displayed Emoji | Components | Code Point Count | Character Count (JavaScript) |
|---|---|---|---|
| 👨👩👧👦 (Family) | 👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦 | 7 | 11 (including surrogate pairs) |
| 👩💻 (Woman Technologist) | 👩 + ZWJ + 💻 | 3 | 5 |
| 🏳️🌈 (Rainbow Flag) | 🏳️ + ZWJ + 🌈 | 4 | 6 |
| 👨🍳 (Man Cook) | 👨 + ZWJ + 🍳 | 3 | 5 |
The family emoji 👨👩👧👦 looks like a single emoji, but internally consists of 4 emoji and 3 ZWJs. JavaScript's .length property returns 11. On social media with character limits, a single emoji like this can consume a large number of characters.
Direction Control Characters - Mechanisms for Right-to-Left Languages
Arabic and Hebrew are languages written right-to-left (RTL). In text where these languages coexist with English (left-to-right, LTR), invisible characters that control text direction are necessary.
U+200E (Left-to-Right Mark) and U+200F (Right-to-Left Mark) are characters for explicitly specifying text direction. When these unintentionally infiltrate text, they can disrupt display order or throw off character counts.
In 2021, a security vulnerability called "Trojan Source" was reported that exploits direction control characters. By embedding direction control characters in source code, code that looks normal to human eyes is interpreted as different logic by the compiler. This vulnerability demonstrated that invisible characters can also pose security risks.
BOM (U+FEFF) - The Invisible Character Lurking at File Beginnings
The Byte Order Mark (BOM) is a character added at the beginning of text files to identify encoding. The UTF-8 BOM is 3 bytes (EF BB BF) and is sometimes added by Windows Notepad when saving files.
BOM is ignored by many programs, but causes problems in these cases:
- BOM at the beginning of PHP files prevents
header()from working (output is judged to have already started) - BOM at the beginning of CSV files prevents the first column name from being recognized correctly
- BOM in JSON files may cause parser errors
- BOM at the beginning of shell scripts prevents the shebang (
#!/bin/bash) from being recognized
Steganography Using Zero-Width Characters (Watermarking Technology)
Steganography (digital watermarking) is a technology that turns the "invisible" property of invisible characters on its head. By embedding patterns of zero-width characters in text, hidden information can be embedded without changing the appearance.
| Method | Characters Used | Purpose | Detection Difficulty |
|---|---|---|---|
| Zero-width character encoding | U+200B, U+200C, U+200D, U+FEFF | Embedding hidden messages in text | High (invisible to the eye) |
| User tracking | Same as above | Identifying leak sources during information breaches | High |
| Copy detection | Same as above | Detecting unauthorized content copying | Medium |
For example, by treating 4 types of zero-width characters as 2-bit information (U+200B = 00, U+200C = 01, U+200D = 10, U+FEFF = 11) and inserting zero-width characters between each word in text, binary data can be hidden within it.
This technology is sometimes used by companies to identify the source of confidential document leaks. By embedding different zero-width character patterns for each recipient, when a document leaks externally, the source can be identified.
Detecting and Removing Invisible Characters
To correctly process text infiltrated by invisible characters, you need to know detection and removal methods.
| Method | Target | Code Example |
|---|---|---|
| JavaScript regex | Major zero-width characters | str.replace(/[\u200B-\u200F\u2028-\u202F\uFEFF]/g, '') |
| Python regex | Same as above | re.sub(r'[\u200b-\u200f\u2028-\u202f\ufeff]', '', text) |
| Text editor | All invisible characters | VS Code: Enable "Show control characters" |
| Command line | Invisible characters in files | cat -v filename or xxd filename |
| PHP | Major zero-width characters | preg_replace('/[\x{200B}-\x{200F}\x{FEFF}]/u', '', $str) |
The JavaScript regex /[\u200B-\u200F\u2028-\u202F\uFEFF]/g removes the most common zero-width and direction control characters at once. Applying this filter to form input values before sending them to the server prevents character count discrepancies caused by invisible characters.
However, unconditionally removing all invisible characters is dangerous. ZWJ is necessary for emoji composition, and removing it will decompose emoji. ZWNJ is essential for correct rendering in Persian and Hindi. Invisible character removal must be done carefully with understanding of purpose and context.
Invisible Character Handling by Programming Language
Different programming languages handle invisible characters in source code differently. Some languages ignore them, while others detect them as errors.
| Language | ZWSP in Source Code | ZWSP in String Literals | Detection Tool |
|---|---|---|---|
| JavaScript | May not cause syntax error | Retained as part of string | ESLint's no-irregular-whitespace |
| Python | SyntaxError | Retained as part of string | pylint, flake8 |
| Java | Compile error | Retained as part of string | Checkstyle |
| Go | Compile error | Retained as part of string | go vet |
| Rust | Compile error (with warning) | Retained as part of string | clippy |
| C/C++ | Compiler-dependent | Retained as part of string | clang-tidy |
JavaScript requires special attention. ZWSP (U+200B) is not treated as "whitespace" in the JavaScript specification, so it may be interpreted as part of a variable name. This means var hello and var he\u200Bllo are treated as different variables. They look like the same "hello," but they're different variables. Cases where this caused bugs have actually been reported.
When considering variable and function name length guidelines, the risk of invisible character infiltration should be kept in mind. Since they cannot be detected visually in code review, it's important to set up automatic detection through linters and editor settings.
Real-World Trouble Cases
Here are some actual troubles caused by invisible characters.
- GitHub code review: ZWSP infiltrated pull request code, causing string comparison failures in production while tests passed. It couldn't be detected in diffs and was only discovered with a binary editor
- E-commerce site search: Product names contained zero-width characters, so users couldn't find products by name search, impacting sales
- Database duplicate check: Identical-looking email addresses were judged as "no duplicates," creating multiple accounts for the same user. The cause was ZWSP in the email address
- PDF copy-paste: Copying text from PDF files introduced massive amounts of soft hyphens (U+00AD), causing form input validation to judge character count exceeded
Character Count Tools and Invisible Characters
How character count tools handle invisible characters varies by tool. Some ignore invisible characters when counting, while others count them as-is. Without understanding Unicode basics, you can't identify why character counts differ between tools.
If you want to accurately count text characters, we recommend first checking for invisible characters and removing them if necessary before counting. Simply knowing that "invisible characters" exist can prevent many character count-related troubles.
Invisible Characters and Security - Unseen Threats
Invisible characters can also become security threats. The "Trojan Source" attack published in 2021 exploits direction control characters (U+202A, U+202B, U+202C, U+202D, U+202E, U+2066, U+2067, U+2068, U+2069) to create a gap between how source code looks and how it actually executes.
| Attack Method | Invisible Characters Used | Impact | Countermeasure |
|---|---|---|---|
| Trojan Source | Direction control chars (U+202A-U+2069) | Malicious logic undetectable in code review | Enable compiler warnings |
| Homograph attack | Visually identical different chars (U+0430 vs U+0061) | Phishing URL spoofing | Check Punycode display |
| ZWSP injection | U+200B | Bypassing input validation | Server-side invisible character removal |
| BOM injection | U+FEFF | File parser malfunction | Automatic BOM removal processing |
Here's a concrete example of a Trojan Source attack. The following code appears to human eyes as "execute processing only when access is permitted," but because direction control characters are embedded, the access check is actually disabled.
This attack is particularly dangerous because it neutralizes code review - the human-eye verification process. Countermeasures include enabling compiler and linter settings that warn about direction control character usage, and incorporating invisible character detection steps into CI/CD pipelines.
As mentioned in the Git commit message writing article, utilizing linters is essential for code quality management. Detecting invisible characters is one of the important roles of linters.
Books on Unicode and text processing can be found on Amazon.