The World of Invisible Characters - Troubles Caused by Zero-Width and Invisible Characters

9 min read

Your string should be 10 characters, but the system insists it's 12. No matter how hard you look, you can't see any extra characters. The culprit is "zero-width characters" - invisible characters that don't appear on screen at all, yet undeniably exist as data. This article explains the types and purposes of invisible characters defined in Unicode, their impact on character counting, and real-world trouble cases with solutions.

Invisible Character Catalog - Characters That Exist Without Being Seen

Unicode defines multiple characters that are not displayed on screen (or have zero width). These are not "bugs" - they exist for legitimate reasons in text processing.

Character NameCode PointPurposeCharacter CountDisplay Width
Zero Width Space (ZWSP)U+200BSpecifying line break opportunitiesCounted as 1 character0
Zero Width Joiner (ZWJ)U+200DJoining characters (emoji composition)Counted as 1 character0
Zero Width Non-Joiner (ZWNJ)U+200CPreventing character joiningCounted as 1 character0
Left-to-Right Mark (LRM)U+200EText direction controlCounted as 1 character0
Right-to-Left Mark (RLM)U+200FText direction controlCounted as 1 character0
Byte Order Mark (BOM)U+FEFFEncoding identificationUsually not counted0
Soft Hyphen (SHY)U+00ADSpecifying hyphenation pointsCounted as 1 characterUsually 0 (shown only at line breaks)
Word Joiner (WJ)U+2060Specifying no-break positionsCounted as 1 character0

All of these characters serve legitimate roles in text processing. The problem is that when they unintentionally infiltrate text, they silently throw off character counts.

Zero Width Space (U+200B) - The Most Troublesome Invisible Character

The Zero Width Space (ZWSP) is a character that embeds "you may break the line here" information into text. It's used in languages like Thai and Khmer that don't use spaces between words, allowing browsers to break lines at appropriate positions.

However, ZWSP easily infiltrates text when copying and pasting from web pages, causing troubles like:

Password infiltration is particularly serious. When ZWSP sneaks into a password copied from a website, you get a situation where the password looks correct but login fails. When considering password length and security, the existence of invisible characters cannot be ignored.

Zero Width Joiner (U+200D) - The Magic Character That Composes Emoji

The Zero Width Joiner (ZWJ) plays the most positive role among invisible characters. As explained in detail in emoji character counting, ZWJ combines multiple emoji to create new ones.

Displayed EmojiComponentsCode Point CountCharacter Count (JavaScript)
👨‍👩‍👧‍👦 (Family)👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦711 (including surrogate pairs)
👩‍💻 (Woman Technologist)👩 + ZWJ + 💻35
🏳️‍🌈 (Rainbow Flag)🏳️ + ZWJ + 🌈46
👨‍🍳 (Man Cook)👨 + ZWJ + 🍳35

The family emoji 👨‍👩‍👧‍👦 looks like a single emoji, but internally consists of 4 emoji and 3 ZWJs. JavaScript's .length property returns 11. On social media with character limits, a single emoji like this can consume a large number of characters.

Direction Control Characters - Mechanisms for Right-to-Left Languages

Arabic and Hebrew are languages written right-to-left (RTL). In text where these languages coexist with English (left-to-right, LTR), invisible characters that control text direction are necessary.

U+200E (Left-to-Right Mark) and U+200F (Right-to-Left Mark) are characters for explicitly specifying text direction. When these unintentionally infiltrate text, they can disrupt display order or throw off character counts.

In 2021, a security vulnerability called "Trojan Source" was reported that exploits direction control characters. By embedding direction control characters in source code, code that looks normal to human eyes is interpreted as different logic by the compiler. This vulnerability demonstrated that invisible characters can also pose security risks.

BOM (U+FEFF) - The Invisible Character Lurking at File Beginnings

The Byte Order Mark (BOM) is a character added at the beginning of text files to identify encoding. The UTF-8 BOM is 3 bytes (EF BB BF) and is sometimes added by Windows Notepad when saving files.

BOM is ignored by many programs, but causes problems in these cases:

Steganography Using Zero-Width Characters (Watermarking Technology)

Steganography (digital watermarking) is a technology that turns the "invisible" property of invisible characters on its head. By embedding patterns of zero-width characters in text, hidden information can be embedded without changing the appearance.

MethodCharacters UsedPurposeDetection Difficulty
Zero-width character encodingU+200B, U+200C, U+200D, U+FEFFEmbedding hidden messages in textHigh (invisible to the eye)
User trackingSame as aboveIdentifying leak sources during information breachesHigh
Copy detectionSame as aboveDetecting unauthorized content copyingMedium

For example, by treating 4 types of zero-width characters as 2-bit information (U+200B = 00, U+200C = 01, U+200D = 10, U+FEFF = 11) and inserting zero-width characters between each word in text, binary data can be hidden within it.

This technology is sometimes used by companies to identify the source of confidential document leaks. By embedding different zero-width character patterns for each recipient, when a document leaks externally, the source can be identified.

Detecting and Removing Invisible Characters

To correctly process text infiltrated by invisible characters, you need to know detection and removal methods.

MethodTargetCode Example
JavaScript regexMajor zero-width charactersstr.replace(/[\u200B-\u200F\u2028-\u202F\uFEFF]/g, '')
Python regexSame as abovere.sub(r'[\u200b-\u200f\u2028-\u202f\ufeff]', '', text)
Text editorAll invisible charactersVS Code: Enable "Show control characters"
Command lineInvisible characters in filescat -v filename or xxd filename
PHPMajor zero-width characterspreg_replace('/[\x{200B}-\x{200F}\x{FEFF}]/u', '', $str)

The JavaScript regex /[\u200B-\u200F\u2028-\u202F\uFEFF]/g removes the most common zero-width and direction control characters at once. Applying this filter to form input values before sending them to the server prevents character count discrepancies caused by invisible characters.

However, unconditionally removing all invisible characters is dangerous. ZWJ is necessary for emoji composition, and removing it will decompose emoji. ZWNJ is essential for correct rendering in Persian and Hindi. Invisible character removal must be done carefully with understanding of purpose and context.

Invisible Character Handling by Programming Language

Different programming languages handle invisible characters in source code differently. Some languages ignore them, while others detect them as errors.

LanguageZWSP in Source CodeZWSP in String LiteralsDetection Tool
JavaScriptMay not cause syntax errorRetained as part of stringESLint's no-irregular-whitespace
PythonSyntaxErrorRetained as part of stringpylint, flake8
JavaCompile errorRetained as part of stringCheckstyle
GoCompile errorRetained as part of stringgo vet
RustCompile error (with warning)Retained as part of stringclippy
C/C++Compiler-dependentRetained as part of stringclang-tidy

JavaScript requires special attention. ZWSP (U+200B) is not treated as "whitespace" in the JavaScript specification, so it may be interpreted as part of a variable name. This means var hello and var he\u200Bllo are treated as different variables. They look like the same "hello," but they're different variables. Cases where this caused bugs have actually been reported.

When considering variable and function name length guidelines, the risk of invisible character infiltration should be kept in mind. Since they cannot be detected visually in code review, it's important to set up automatic detection through linters and editor settings.

Real-World Trouble Cases

Here are some actual troubles caused by invisible characters.

Character Count Tools and Invisible Characters

How character count tools handle invisible characters varies by tool. Some ignore invisible characters when counting, while others count them as-is. Without understanding Unicode basics, you can't identify why character counts differ between tools.

If you want to accurately count text characters, we recommend first checking for invisible characters and removing them if necessary before counting. Simply knowing that "invisible characters" exist can prevent many character count-related troubles.

Invisible Characters and Security - Unseen Threats

Invisible characters can also become security threats. The "Trojan Source" attack published in 2021 exploits direction control characters (U+202A, U+202B, U+202C, U+202D, U+202E, U+2066, U+2067, U+2068, U+2069) to create a gap between how source code looks and how it actually executes.

Attack MethodInvisible Characters UsedImpactCountermeasure
Trojan SourceDirection control chars (U+202A-U+2069)Malicious logic undetectable in code reviewEnable compiler warnings
Homograph attackVisually identical different chars (U+0430 vs U+0061)Phishing URL spoofingCheck Punycode display
ZWSP injectionU+200BBypassing input validationServer-side invisible character removal
BOM injectionU+FEFFFile parser malfunctionAutomatic BOM removal processing

Here's a concrete example of a Trojan Source attack. The following code appears to human eyes as "execute processing only when access is permitted," but because direction control characters are embedded, the access check is actually disabled.

This attack is particularly dangerous because it neutralizes code review - the human-eye verification process. Countermeasures include enabling compiler and linter settings that warn about direction control character usage, and incorporating invisible character detection steps into CI/CD pipelines.

As mentioned in the Git commit message writing article, utilizing linters is essential for code quality management. Detecting invisible characters is one of the important roles of linters.

Books on Unicode and text processing can be found on Amazon.

Share this article