Entropy (Information Content)

A measure of uncertainty in information theory. Higher entropy in text means it is harder to predict and more difficult to compress; lower entropy indicates greater redundancy and easier compression.

Entropy is a core concept in information theory, introduced by Claude Shannon in 1948. When applied to text, entropy quantifies "how unpredictable the next character is," measured in bits per character. Text with high entropy carries dense information, while text with low entropy is highly redundant.

The entropy of English text is estimated at roughly 1.0 to 1.5 bits per character. If all 26 letters appeared with equal probability, the entropy would be log2(26), approximately 4.7 bits per character. In practice, "e" is the most frequent letter while "z" barely appears, and common patterns like "th," "ing," and "tion" further reduce the effective entropy.

Japanese text is generally considered to have higher entropy than English. With thousands of kanji in active use plus hiragana, katakana, and alphanumeric characters, predicting the next character is harder. However, each Japanese character carries more information than a single English letter (one kanji can convey the meaning of several English letters), so the same content typically requires fewer characters in Japanese than in English.

Entropy and text compression are directly linked. Shannon's source coding theorem states that the compression limit of a text is determined by its entropy. Text with an entropy of 1.5 bits per character can theoretically be compressed to 1.5 bits per character. Compared to ASCII's 8 bits per character, this implies a theoretical maximum compression ratio of about 81%. Modern algorithms like gzip and Brotli approach this theoretical limit in practice.

Entropy is also used to evaluate password strength. An 8-character lowercase password has an entropy of log2(26^8), roughly 37.6 bits. Using uppercase, lowercase, digits, and symbols (95 possible characters) raises it to log2(95^8), about 52.6 bits. NIST guidelines recommend a minimum of 30 bits of entropy for online service passwords. Increasing the password length is more effective at raising entropy than expanding the character set. Information theory books on Amazon explore these concepts in depth.

In the context of character counting, entropy serves as a theoretical indicator of "how much information can be conveyed within a given character count." A 280-character tweet filled with formulaic greetings (low entropy) carries far less information than one packed with technical analysis (high entropy). To maximize the information delivered within a character limit, eliminating redundant phrasing and raising the entropy of your text is an effective strategy.

Entropy (Information Content)

Share this article

Related Terms

Related Articles