Steganography: Hiding Messages in Text

Steganography - The Art of Hiding Secret Messages in Text and Character Count

7 min read

There's a secret message hidden in this paragraph - if someone told you that, where would you look? The first letter of each sentence? The spacing between certain characters? Or perhaps invisible characters embedded within? Steganography is the art of hiding the very existence of a message. If encryption makes a message "unreadable," steganography makes it "undetectable." And this technique can sometimes be detected through the simple act of character counting.

The Ancient Art of Hiding

The history of steganography dates back to 5th century BC Greece. In a famous episode recorded by historian Herodotus, a slave's head was shaved, a message was tattooed on the scalp, and after the hair grew back, the slave was sent as a messenger to warn of the Persian Empire's invasion. The ultimate low-bandwidth communication, requiring weeks for message delivery.

In medieval Europe, invisible inks (lemon juice, milk, urine, etc.) were widely used for secret communication. During World War II, German spies used microdot technology (photographically shrinking documents to microscopic size and disguising them as periods). This technology, capable of hiding a full page of text in a single period, was called "the greatest advance in enemy espionage" by FBI Director Hoover.

Text-Based Steganography Techniques

Digital-age text steganography employs several representative techniques.

Acrostics - Messages Hidden in Initial Letters

An acrostic is a technique where connecting the first letters of each line or sentence reveals a secret message. It's the most classical form of text steganography, used in poetry and lyrics since ancient times.

A famous example: in 2003, California Attorney General Bill Lockyer sent a letter to a resigning state legislator. When the first letters of each paragraph were connected, they spelled "I FUCK YOU," causing a major scandal. Acrostics can embed messages without increasing character count, but they're easily discovered when intentionally searched for.

Whitespace Manipulation

This technique embeds bit information by manipulating the number of spaces between words. One space represents "0" and two spaces represent "1," encoding binary data. The subtle difference in spacing is hard for humans to notice, but a character counting tool can detect that "there are too many spaces relative to the visible word count."

Zero-Width Character Steganography - The World of Invisible Characters

The most powerful modern text steganography technique uses zero-width invisible characters. Unicode defines several "zero-width characters" that don't display on screen but exist as character data.

Unicode Code Point	Name	Original Purpose	Steganographic Role
U+200B	Zero Width Space	Specifying line break opportunities	Represents bit "0"
U+200C	Zero Width Non-Joiner	Suppressing ligatures	Represents bit "1"
U+200D	Zero Width Joiner	Promoting ligatures	Additional bit value
U+FEFF	Zero Width No-Break Space (BOM)	Byte order mark	Delimiter character

Using two types of zero-width characters, U+200B and U+200C, you can represent 1 bit with 2 values (0 and 1). Eight zero-width characters make 1 byte, meaning 1 ASCII character. Hiding the 5-character message "Hello" requires 40 zero-width characters.

Distributing these 40 zero-width characters between words in normal text makes the appearance completely unchanged. However, comparing "visible character count" with "actual character count (byte count)" using a character counting tool reveals an unnatural discrepancy. Understanding Unicode fundamentals helps identify that zero-width characters cause this discrepancy.

Zero-Width Character Steganography Implementation

Let's look at a concrete embedding process. Consider hiding the secret message "Hi" in the normal text "Good morning."

"H" has ASCII code 72, binary 01001000. "i" is 105, binary 01101001. Converting 0 to U+200B (zero-width space) and 1 to U+200C (zero-width non-joiner) generates a string of 16 zero-width characters.

Inserting these 16 zero-width characters between "Good" and "morning" leaves the appearance as "Good morning," but the actual data contains 16 invisible characters. A text editor counts 12 characters, but programmatically counting Unicode code points yields 28. The difference of 16 characters is the hidden message.

More advanced implementations use 3 or more types of zero-width characters for ternary or higher encoding, representing the same message with fewer zero-width characters. Using U+200B, U+200C, and U+200D provides about 1.58 bits per character (log₂3), representing an 8-bit ASCII character with about 5 zero-width characters.

Homoglyph Attacks - Different Characters That Look Identical

Homoglyphs are characters that look nearly identical but have different Unicode code points. For example, Latin "a" (U+0061) and Cyrillic "а" (U+0430) appear completely identical in many fonts.

Latin Character	Code Point	Cyrillic Character	Code Point	Visual Difference
a	U+0061	а	U+0430	Nearly identical
e	U+0065	е	U+0435	Nearly identical
o	U+006F	о	U+043E	Nearly identical
p	U+0070	р	U+0440	Nearly identical
c	U+0063	с	U+0441	Nearly identical

Homoglyph attacks exploit this property. Replacing the "a" in "apple.com" with Cyrillic "а" in a phishing URL looks identical but redirects to a completely different domain. In steganography, replacing specific characters with homoglyphs embeds bit information.

Detecting homoglyphs requires checking the Unicode code point of each character. As discussed in password length and security, cases where appearance is identical but byte sequences differ pose serious security risks.

As a countermeasure, major browsers restrict IDN (Internationalized Domain Name) display. When domain names mix multiple scripts (Latin and Cyrillic, etc.), browsers display the domain in Punycode (encoded format starting with xn--) to warn users of fake sites. Chrome introduced this measure in version 58 in 2017.

Text Watermarking Technology

As an application of steganography, text digital watermarking technology exists. While image and video watermarks are widely known, techniques for embedding watermarks in text also exist.

Watermark Method	Principle	Detection Method	Resilience
Zero-width character embedding	Stores bit information in invisible characters	Character counting	May be lost on copy-paste
Synonym substitution	Substitutes synonyms like "big" → "large"	Comparison with original	Resilient to text editing
Syntactic transformation	Transforms active → passive voice	Comparison with original	Resilient to text editing
Whitespace manipulation	Manipulates space and tab counts	Statistical analysis of whitespace	Lost on format changes

Synonym substitution watermarking embeds bit information without changing text meaning. For example, substituting "big" with "large" represents 1 bit of information. This method may change character count but is resilient to text editing and copy-paste.

Encryption vs. Steganography

Encryption and steganography are often confused but are fundamentally different technologies.

Property	Encryption	Steganography
Purpose	Make message content unreadable	Hide message existence
Detectability	Ciphertext existence is obvious	Message existence itself is unknown
Effect on character count	Similar to original text	Cover text character count may increase
Key requirement	Key needed for decryption	May be extractable with technique knowledge
Combination	Can be used alone	Most powerful when combined with encryption

The safest approach is to encrypt a message and then hide it with steganography. Even if the steganography is broken and the message's existence is discovered, the content remains unreadable if encrypted.

Detecting Steganography Through Character Counting

The simplest method for detecting text-based steganography is character counting. The following unnatural discrepancies serve as detection hints.

Mismatch between visible character count and actual character count (code point count). When zero-width characters are embedded, the character count visible in a text editor is less than the programmatic count. For example, if text that appears to be 100 characters actually contains 180 characters of data, 80 zero-width characters may be embedded.

Unnatural character encoding sizes also provide clues. Pure ASCII text (alphanumeric only) should be 1 character = 1 byte in UTF-8. However, if Cyrillic homoglyphs are mixed in, some characters that look ASCII become 2 bytes. If the total byte count exceeds the character count, homoglyph presence should be suspected.

Twitter's Zero-Width Character Countermeasures

Twitter (now X) adopted unique character counting rules to prevent steganography and character limit circumvention using zero-width characters. Twitter's character counting library "twitter-text" counts certain Unicode characters including zero-width characters toward the character count. This means embedding many zero-width characters causes visually short text to reach the 280-character limit.

This countermeasure serves not only to prevent steganography but also to ensure fair service usage. Unlimited zero-width character embedding could unfairly consume database storage or cause timeline display issues.

Steganography Detection Tools and Techniques

Several specialized tools and techniques exist for detecting text-based steganography.

Detection Method	Target	Principle	Limitation
Character count vs. byte count comparison	Zero-width characters	Mismatch between visible count and actual bytes	Difficult to distinguish from legitimate zero-width chars
Unicode category analysis	Homoglyphs	Verify Unicode block consistency of characters	Many false positives in multilingual text
Statistical analysis	Whitespace manipulation	Verify space distribution matches natural language statistics	Low accuracy for short texts
Entropy analysis	General	Verify text information entropy is within natural language range	Difficult against advanced techniques

The simplest and most effective detection method is to copy-paste text to plain text and compare byte counts with the original. If zero-width characters or homoglyphs are present, byte counts will differ. A character counting tool that displays both "visible character count" and "Unicode code point count" can instantly detect this discrepancy.

SNS and Steganography - Real Cases

In 2016, security researchers reported actual use of zero-width character steganography on Twitter (now X). Tweets appeared as normal text within 280 characters, but including zero-width characters, the actual data volume was far larger.

Some companies implement "document fingerprinting" by embedding employee IDs in internal documents using zero-width characters. If confidential documents leak externally, analyzing the embedded zero-width characters can identify who leaked them. This technique is harder to detect than traditional watermarks since it doesn't change the document's appearance at all.

Steganography is an important technology for both privacy protection and information security. In countries with strict censorship, activists use steganography to disseminate information. On the other hand, it could also be used by terrorists and criminals to hide communications. The seemingly simple tool of character counting can serve as the first line of defense in detecting such hidden messages.

Books on information security and cryptography can be found on Amazon.

Steganography - The Art of Hiding Secret Messages in Text and Character Count

The Ancient Art of Hiding

Text-Based Steganography Techniques

Acrostics - Messages Hidden in Initial Letters

Whitespace Manipulation

Zero-Width Character Steganography - The World of Invisible Characters

Zero-Width Character Steganography Implementation

Homoglyph Attacks - Different Characters That Look Identical

Text Watermarking Technology

Encryption vs. Steganography

Detecting Steganography Through Character Counting

Twitter's Zero-Width Character Countermeasures

Steganography Detection Tools and Techniques

SNS and Steganography - Real Cases

Share this article

Related Articles

Invisible Characters & Zero-Width Troubles

Unicode: A Beginner's Encoding Guide

Password Length & Security Best Practices

AI Prompt Character Limits and Engineering

Amazon Listing Character Limits Guide

API Response Length Design Guide