Invisible Characters & Zero-Width Troubles

The World of Invisible Characters - Troubles Caused by Zero-Width and Invisible Characters

9 min read

Your string should be 10 characters, but the system insists it's 12. No matter how hard you look, you can't see any extra characters. The culprit is "zero-width characters" - invisible characters that don't appear on screen at all, yet undeniably exist as data. This article explains the types and purposes of invisible characters defined in Unicode, their impact on character counting, and real-world trouble cases with solutions.

Invisible Character Catalog - Characters That Exist Without Being Seen

Unicode defines multiple characters that are not displayed on screen (or have zero width). These are not "bugs" - they exist for legitimate reasons in text processing.

Character Name	Code Point	Purpose	Character Count	Display Width
Zero Width Space (ZWSP)	U+200B	Specifying line break opportunities	Counted as 1 character	0
Zero Width Joiner (ZWJ)	U+200D	Joining characters (emoji composition)	Counted as 1 character	0
Zero Width Non-Joiner (ZWNJ)	U+200C	Preventing character joining	Counted as 1 character	0
Left-to-Right Mark (LRM)	U+200E	Text direction control	Counted as 1 character	0
Right-to-Left Mark (RLM)	U+200F	Text direction control	Counted as 1 character	0
Byte Order Mark (BOM)	U+FEFF	Encoding identification	Usually not counted	0
Soft Hyphen (SHY)	U+00AD	Specifying hyphenation points	Counted as 1 character	Usually 0 (shown only at line breaks)
Word Joiner (WJ)	U+2060	Specifying no-break positions	Counted as 1 character	0

All of these characters serve legitimate roles in text processing. The problem is that when they unintentionally infiltrate text, they silently throw off character counts.

Zero Width Space (U+200B) - The Most Troublesome Invisible Character

The Zero Width Space (ZWSP) is a character that embeds "you may break the line here" information into text. It's used in languages like Thai and Khmer that don't use spaces between words, allowing browsers to break lines at appropriate positions.

However, ZWSP easily infiltrates text when copying and pasting from web pages, causing troubles like:

Form input judged as "exceeding character limit" (looks within limit visually)
Password copy-paste failures (ZWSP infiltrates making it a different string)
Search mismatches (identical-looking strings don't match in search)
CSV file data not parsing correctly
Program source code infiltration causing compile errors

Password infiltration is particularly serious. When ZWSP sneaks into a password copied from a website, you get a situation where the password looks correct but login fails. When considering password length and security, the existence of invisible characters cannot be ignored.

Zero Width Joiner (U+200D) - The Magic Character That Composes Emoji

The Zero Width Joiner (ZWJ) plays the most positive role among invisible characters. As explained in detail in emoji character counting, ZWJ combines multiple emoji to create new ones.

Displayed Emoji	Components	Code Point Count	Character Count (JavaScript)
👨‍👩‍👧‍👦 (Family)	👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦	7	11 (including surrogate pairs)
👩‍💻 (Woman Technologist)	👩 + ZWJ + 💻	3	5
🏳️‍🌈 (Rainbow Flag)	🏳️ + ZWJ + 🌈	4	6
👨‍🍳 (Man Cook)	👨 + ZWJ + 🍳	3	5

The family emoji 👨‍👩‍👧‍👦 looks like a single emoji, but internally consists of 4 emoji and 3 ZWJs. JavaScript's .length property returns 11. On social media with character limits, a single emoji like this can consume a large number of characters.

Direction Control Characters - Mechanisms for Right-to-Left Languages

Arabic and Hebrew are languages written right-to-left (RTL). In text where these languages coexist with English (left-to-right, LTR), invisible characters that control text direction are necessary.

U+200E (Left-to-Right Mark) and U+200F (Right-to-Left Mark) are characters for explicitly specifying text direction. When these unintentionally infiltrate text, they can disrupt display order or throw off character counts.

In 2021, a security vulnerability called "Trojan Source" was reported that exploits direction control characters. By embedding direction control characters in source code, code that looks normal to human eyes is interpreted as different logic by the compiler. This vulnerability demonstrated that invisible characters can also pose security risks.

BOM (U+FEFF) - The Invisible Character Lurking at File Beginnings

The Byte Order Mark (BOM) is a character added at the beginning of text files to identify encoding. The UTF-8 BOM is 3 bytes (EF BB BF) and is sometimes added by Windows Notepad when saving files.

BOM is ignored by many programs, but causes problems in these cases:

BOM at the beginning of PHP files prevents header() from working (output is judged to have already started)
BOM at the beginning of CSV files prevents the first column name from being recognized correctly
BOM in JSON files may cause parser errors
BOM at the beginning of shell scripts prevents the shebang (#!/bin/bash) from being recognized

Steganography Using Zero-Width Characters (Watermarking Technology)

Steganography (digital watermarking) is a technology that turns the "invisible" property of invisible characters on its head. By embedding patterns of zero-width characters in text, hidden information can be embedded without changing the appearance.

Method	Characters Used	Purpose	Detection Difficulty
Zero-width character encoding	U+200B, U+200C, U+200D, U+FEFF	Embedding hidden messages in text	High (invisible to the eye)
User tracking	Same as above	Identifying leak sources during information breaches	High
Copy detection	Same as above	Detecting unauthorized content copying	Medium

For example, by treating 4 types of zero-width characters as 2-bit information (U+200B = 00, U+200C = 01, U+200D = 10, U+FEFF = 11) and inserting zero-width characters between each word in text, binary data can be hidden within it.

This technology is sometimes used by companies to identify the source of confidential document leaks. By embedding different zero-width character patterns for each recipient, when a document leaks externally, the source can be identified.

Detecting and Removing Invisible Characters

To correctly process text infiltrated by invisible characters, you need to know detection and removal methods.

Method	Target	Code Example
JavaScript regex	Major zero-width characters	`str.replace(/[\u200B-\u200F\u2028-\u202F\uFEFF]/g, '')`
Python regex	Same as above	`re.sub(r'[\u200b-\u200f\u2028-\u202f\ufeff]', '', text)`
Text editor	All invisible characters	VS Code: Enable "Show control characters"
Command line	Invisible characters in files	`cat -v filename` or `xxd filename`
PHP	Major zero-width characters	`preg_replace('/[\x{200B}-\x{200F}\x{FEFF}]/u', '', $str)`

The JavaScript regex /[\u200B-\u200F\u2028-\u202F\uFEFF]/g removes the most common zero-width and direction control characters at once. Applying this filter to form input values before sending them to the server prevents character count discrepancies caused by invisible characters.

However, unconditionally removing all invisible characters is dangerous. ZWJ is necessary for emoji composition, and removing it will decompose emoji. ZWNJ is essential for correct rendering in Persian and Hindi. Invisible character removal must be done carefully with understanding of purpose and context.

Invisible Character Handling by Programming Language

Different programming languages handle invisible characters in source code differently. Some languages ignore them, while others detect them as errors.

Language	ZWSP in Source Code	ZWSP in String Literals	Detection Tool
JavaScript	May not cause syntax error	Retained as part of string	ESLint's no-irregular-whitespace
Python	SyntaxError	Retained as part of string	pylint, flake8
Java	Compile error	Retained as part of string	Checkstyle
Go	Compile error	Retained as part of string	go vet
Rust	Compile error (with warning)	Retained as part of string	clippy
C/C++	Compiler-dependent	Retained as part of string	clang-tidy

JavaScript requires special attention. ZWSP (U+200B) is not treated as "whitespace" in the JavaScript specification, so it may be interpreted as part of a variable name. This means var hello and var he\u200Bllo are treated as different variables. They look like the same "hello," but they're different variables. Cases where this caused bugs have actually been reported.

When considering variable and function name length guidelines, the risk of invisible character infiltration should be kept in mind. Since they cannot be detected visually in code review, it's important to set up automatic detection through linters and editor settings.

Real-World Trouble Cases

Here are some actual troubles caused by invisible characters.

GitHub code review: ZWSP infiltrated pull request code, causing string comparison failures in production while tests passed. It couldn't be detected in diffs and was only discovered with a binary editor
E-commerce site search: Product names contained zero-width characters, so users couldn't find products by name search, impacting sales
Database duplicate check: Identical-looking email addresses were judged as "no duplicates," creating multiple accounts for the same user. The cause was ZWSP in the email address
PDF copy-paste: Copying text from PDF files introduced massive amounts of soft hyphens (U+00AD), causing form input validation to judge character count exceeded

Character Count Tools and Invisible Characters

How character count tools handle invisible characters varies by tool. Some ignore invisible characters when counting, while others count them as-is. Without understanding Unicode basics, you can't identify why character counts differ between tools.

If you want to accurately count text characters, we recommend first checking for invisible characters and removing them if necessary before counting. Simply knowing that "invisible characters" exist can prevent many character count-related troubles.

Invisible Characters and Security - Unseen Threats

Invisible characters can also become security threats. The "Trojan Source" attack published in 2021 exploits direction control characters (U+202A, U+202B, U+202C, U+202D, U+202E, U+2066, U+2067, U+2068, U+2069) to create a gap between how source code looks and how it actually executes.

Attack Method	Invisible Characters Used	Impact	Countermeasure
Trojan Source	Direction control chars (U+202A-U+2069)	Malicious logic undetectable in code review	Enable compiler warnings
Homograph attack	Visually identical different chars (U+0430 vs U+0061)	Phishing URL spoofing	Check Punycode display
ZWSP injection	U+200B	Bypassing input validation	Server-side invisible character removal
BOM injection	U+FEFF	File parser malfunction	Automatic BOM removal processing

Here's a concrete example of a Trojan Source attack. The following code appears to human eyes as "execute processing only when access is permitted," but because direction control characters are embedded, the access check is actually disabled.

This attack is particularly dangerous because it neutralizes code review - the human-eye verification process. Countermeasures include enabling compiler and linter settings that warn about direction control character usage, and incorporating invisible character detection steps into CI/CD pipelines.

As mentioned in the Git commit message writing article, utilizing linters is essential for code quality management. Detecting invisible characters is one of the important roles of linters.

Books on Unicode and text processing can be found on Amazon.

The World of Invisible Characters - Troubles Caused by Zero-Width and Invisible Characters

Invisible Character Catalog - Characters That Exist Without Being Seen

Zero Width Space (U+200B) - The Most Troublesome Invisible Character

Zero Width Joiner (U+200D) - The Magic Character That Composes Emoji

Direction Control Characters - Mechanisms for Right-to-Left Languages

BOM (U+FEFF) - The Invisible Character Lurking at File Beginnings

Steganography Using Zero-Width Characters (Watermarking Technology)

Detecting and Removing Invisible Characters

Invisible Character Handling by Programming Language

Real-World Trouble Cases

Character Count Tools and Invisible Characters

Invisible Characters and Security - Unseen Threats

Share this article

Related Articles

Emoji Counting: Why One Emoji Is Multiple

Unicode: A Beginner's Encoding Guide

Characters vs. Bytes: UTF-8 Encoding Guide

AI Prompt Character Limits and Engineering

Amazon Listing Character Limits Guide

API Response Length Design Guide