Steganography - The Art of Hiding Secret Messages in Text and Character Count
There's a secret message hidden in this paragraph - if someone told you that, where would you look? The first letter of each sentence? The spacing between certain characters? Or perhaps invisible characters embedded within? Steganography is the art of hiding the very existence of a message. If encryption makes a message "unreadable," steganography makes it "undetectable." And this technique can sometimes be detected through the simple act of character counting.
The Ancient Art of Hiding
The history of steganography dates back to 5th century BC Greece. In a famous episode recorded by historian Herodotus, a slave's head was shaved, a message was tattooed on the scalp, and after the hair grew back, the slave was sent as a messenger to warn of the Persian Empire's invasion. The ultimate low-bandwidth communication, requiring weeks for message delivery.
In medieval Europe, invisible inks (lemon juice, milk, urine, etc.) were widely used for secret communication. During World War II, German spies used microdot technology (photographically shrinking documents to microscopic size and disguising them as periods). This technology, capable of hiding a full page of text in a single period, was called "the greatest advance in enemy espionage" by FBI Director Hoover.
Text-Based Steganography Techniques
Digital-age text steganography employs several representative techniques.
Acrostics - Messages Hidden in Initial Letters
An acrostic is a technique where connecting the first letters of each line or sentence reveals a secret message. It's the most classical form of text steganography, used in poetry and lyrics since ancient times.
A famous example: in 2003, California Attorney General Bill Lockyer sent a letter to a resigning state legislator. When the first letters of each paragraph were connected, they spelled "I FUCK YOU," causing a major scandal. Acrostics can embed messages without increasing character count, but they're easily discovered when intentionally searched for.
Whitespace Manipulation
This technique embeds bit information by manipulating the number of spaces between words. One space represents "0" and two spaces represent "1," encoding binary data. The subtle difference in spacing is hard for humans to notice, but a character counting tool can detect that "there are too many spaces relative to the visible word count."
Zero-Width Character Steganography - The World of Invisible Characters
The most powerful modern text steganography technique uses zero-width invisible characters. Unicode defines several "zero-width characters" that don't display on screen but exist as character data.
| Unicode Code Point | Name | Original Purpose | Steganographic Role |
|---|---|---|---|
| U+200B | Zero Width Space | Specifying line break opportunities | Represents bit "0" |
| U+200C | Zero Width Non-Joiner | Suppressing ligatures | Represents bit "1" |
| U+200D | Zero Width Joiner | Promoting ligatures | Additional bit value |
| U+FEFF | Zero Width No-Break Space (BOM) | Byte order mark | Delimiter character |
Using two types of zero-width characters, U+200B and U+200C, you can represent 1 bit with 2 values (0 and 1). Eight zero-width characters make 1 byte, meaning 1 ASCII character. Hiding the 5-character message "Hello" requires 40 zero-width characters.
Distributing these 40 zero-width characters between words in normal text makes the appearance completely unchanged. However, comparing "visible character count" with "actual character count (byte count)" using a character counting tool reveals an unnatural discrepancy. Understanding Unicode fundamentals helps identify that zero-width characters cause this discrepancy.
Zero-Width Character Steganography Implementation
Let's look at a concrete embedding process. Consider hiding the secret message "Hi" in the normal text "Good morning."
"H" has ASCII code 72, binary 01001000. "i" is 105, binary 01101001. Converting 0 to U+200B (zero-width space) and 1 to U+200C (zero-width non-joiner) generates a string of 16 zero-width characters.
Inserting these 16 zero-width characters between "Good" and "morning" leaves the appearance as "Good morning," but the actual data contains 16 invisible characters. A text editor counts 12 characters, but programmatically counting Unicode code points yields 28. The difference of 16 characters is the hidden message.
More advanced implementations use 3 or more types of zero-width characters for ternary or higher encoding, representing the same message with fewer zero-width characters. Using U+200B, U+200C, and U+200D provides about 1.58 bits per character (log₂3), representing an 8-bit ASCII character with about 5 zero-width characters.
Homoglyph Attacks - Different Characters That Look Identical
Homoglyphs are characters that look nearly identical but have different Unicode code points. For example, Latin "a" (U+0061) and Cyrillic "а" (U+0430) appear completely identical in many fonts.
| Latin Character | Code Point | Cyrillic Character | Code Point | Visual Difference |
|---|---|---|---|---|
| a | U+0061 | а | U+0430 | Nearly identical |
| e | U+0065 | е | U+0435 | Nearly identical |
| o | U+006F | о | U+043E | Nearly identical |
| p | U+0070 | р | U+0440 | Nearly identical |
| c | U+0063 | с | U+0441 | Nearly identical |
Homoglyph attacks exploit this property. Replacing the "a" in "apple.com" with Cyrillic "а" in a phishing URL looks identical but redirects to a completely different domain. In steganography, replacing specific characters with homoglyphs embeds bit information.
Detecting homoglyphs requires checking the Unicode code point of each character. As discussed in password length and security, cases where appearance is identical but byte sequences differ pose serious security risks.
As a countermeasure, major browsers restrict IDN (Internationalized Domain Name) display. When domain names mix multiple scripts (Latin and Cyrillic, etc.), browsers display the domain in Punycode (encoded format starting with xn--) to warn users of fake sites. Chrome introduced this measure in version 58 in 2017.
Text Watermarking Technology
As an application of steganography, text digital watermarking technology exists. While image and video watermarks are widely known, techniques for embedding watermarks in text also exist.
| Watermark Method | Principle | Detection Method | Resilience |
|---|---|---|---|
| Zero-width character embedding | Stores bit information in invisible characters | Character counting | May be lost on copy-paste |
| Synonym substitution | Substitutes synonyms like "big" → "large" | Comparison with original | Resilient to text editing |
| Syntactic transformation | Transforms active → passive voice | Comparison with original | Resilient to text editing |
| Whitespace manipulation | Manipulates space and tab counts | Statistical analysis of whitespace | Lost on format changes |
Synonym substitution watermarking embeds bit information without changing text meaning. For example, substituting "big" with "large" represents 1 bit of information. This method may change character count but is resilient to text editing and copy-paste.
Encryption vs. Steganography
Encryption and steganography are often confused but are fundamentally different technologies.
| Property | Encryption | Steganography |
|---|---|---|
| Purpose | Make message content unreadable | Hide message existence |
| Detectability | Ciphertext existence is obvious | Message existence itself is unknown |
| Effect on character count | Similar to original text | Cover text character count may increase |
| Key requirement | Key needed for decryption | May be extractable with technique knowledge |
| Combination | Can be used alone | Most powerful when combined with encryption |
The safest approach is to encrypt a message and then hide it with steganography. Even if the steganography is broken and the message's existence is discovered, the content remains unreadable if encrypted.
Detecting Steganography Through Character Counting
The simplest method for detecting text-based steganography is character counting. The following unnatural discrepancies serve as detection hints.
Mismatch between visible character count and actual character count (code point count). When zero-width characters are embedded, the character count visible in a text editor is less than the programmatic count. For example, if text that appears to be 100 characters actually contains 180 characters of data, 80 zero-width characters may be embedded.
Unnatural character encoding sizes also provide clues. Pure ASCII text (alphanumeric only) should be 1 character = 1 byte in UTF-8. However, if Cyrillic homoglyphs are mixed in, some characters that look ASCII become 2 bytes. If the total byte count exceeds the character count, homoglyph presence should be suspected.
Twitter's Zero-Width Character Countermeasures
Twitter (now X) adopted unique character counting rules to prevent steganography and character limit circumvention using zero-width characters. Twitter's character counting library "twitter-text" counts certain Unicode characters including zero-width characters toward the character count. This means embedding many zero-width characters causes visually short text to reach the 280-character limit.
This countermeasure serves not only to prevent steganography but also to ensure fair service usage. Unlimited zero-width character embedding could unfairly consume database storage or cause timeline display issues.
Steganography Detection Tools and Techniques
Several specialized tools and techniques exist for detecting text-based steganography.
| Detection Method | Target | Principle | Limitation |
|---|---|---|---|
| Character count vs. byte count comparison | Zero-width characters | Mismatch between visible count and actual bytes | Difficult to distinguish from legitimate zero-width chars |
| Unicode category analysis | Homoglyphs | Verify Unicode block consistency of characters | Many false positives in multilingual text |
| Statistical analysis | Whitespace manipulation | Verify space distribution matches natural language statistics | Low accuracy for short texts |
| Entropy analysis | General | Verify text information entropy is within natural language range | Difficult against advanced techniques |
The simplest and most effective detection method is to copy-paste text to plain text and compare byte counts with the original. If zero-width characters or homoglyphs are present, byte counts will differ. A character counting tool that displays both "visible character count" and "Unicode code point count" can instantly detect this discrepancy.
SNS and Steganography - Real Cases
In 2016, security researchers reported actual use of zero-width character steganography on Twitter (now X). Tweets appeared as normal text within 280 characters, but including zero-width characters, the actual data volume was far larger.
Some companies implement "document fingerprinting" by embedding employee IDs in internal documents using zero-width characters. If confidential documents leak externally, analyzing the embedded zero-width characters can identify who leaked them. This technique is harder to detect than traditional watermarks since it doesn't change the document's appearance at all.
Steganography is an important technology for both privacy protection and information security. In countries with strict censorship, activists use steganography to disseminate information. On the other hand, it could also be used by terrorists and criminals to hide communications. The seemingly simple tool of character counting can serve as the first line of defense in detecting such hidden messages.
Books on information security and cryptography can be found on Amazon.