Invisible Character
A collective term for characters that exist within text data but have no visible representation on screen. Examples include zero-width spaces, bidirectional control characters, and soft hyphens, all of which can affect character counting.
Invisible characters are characters that exist in a text string but produce no visual output on screen. Unlike control characters (such as newlines and tabs), many invisible characters function as "hidden instructions" that influence how text is displayed or processed. In character counting, invisible characters are among the trickiest pitfalls to handle.
Unicode defines a large number of invisible characters. The zero-width space (U+200B) is a space with no width, inserted within long words to indicate permissible line-break points. The zero-width joiner (ZWJ, U+200D) joins adjacent characters and is used to compose emoji sequences (👨+ZWJ+👩+ZWJ+👧 = 👨👩👧). The zero-width non-joiner (ZWNJ, U+200C) does the opposite, preventing ligatures, and is used in Arabic and Persian to control character joining behavior.
Bidirectional control characters are another category of invisible characters. The left-to-right mark (LRM, U+200E) and right-to-left mark (RLM, U+200F) control the writing direction in text that mixes scripts such as Arabic or Hebrew with Latin characters. These characters are invisible but have a significant effect on the order in which text is displayed.
The impact of invisible characters on character counting can be severe. When you copy and paste text from a web page, zero-width spaces or bidirectional control characters may be carried along silently. If two strings look identical but a character counter returns different values, invisible character contamination is the likely cause. For example, "Hello" might count as 7 characters instead of 5 if two zero-width spaces have been inserted between "H" and "e." Unicode references on Amazon provide detailed coverage of these edge cases.
From a security standpoint, invisible characters can be weaponized. In 2021, researchers disclosed the "Trojan Source" attack, which inserts bidirectional control characters into source code so that the code appears normal to a human reader but executes different logic. Inserting zero-width characters into usernames or passwords to create strings that look identical but are technically different is another known technique.
As a countermeasure, sanitizing text by removing unwanted invisible characters before processing is recommended. The regular expression /[\u200B-\u200F\u2028-\u202F\u2060-\u206F\uFEFF]/g detects and removes the most common invisible characters. However, ZWJ is required for emoji composition, so stripping it indiscriminately will break composite emoji. Selective removal based on the use case is necessary.