Morse Code and Character Efficiency - Why "E" Is One Dot and "Q" Takes Four Symbols
In 1838, Samuel Morse walked into a New York print shop and began counting the contents of the type cases. He wanted to find out which letters had the most type prepared - in other words, which letters were used most frequently in English. This painstaking investigation gave birth to the idea of "optimizing code length based on frequency" - a precursor to information theory. Assign short codes to frequently used characters and long codes to rarely used ones. This principle would be mathematically formalized about 100 years later when Claude Shannon established information theory, and further as Huffman coding.
Code Design Born from the Type Case
The type inventory Morse observed at the print shop directly reflected letter frequency in English. The type case had the most "E" type pieces, while "Z" and "Q" had very few. Based on these observations, Morse assigned short codes to frequent letters and long codes to rare ones.
| Letter | Morse Code | Code Length (dots and dashes) | English Frequency |
|---|---|---|---|
| E | . | 1 | About 12.7% |
| T | - | 1 | About 9.1% |
| A | .- | 2 | About 8.2% |
| I | .. | 2 | About 7.0% |
| N | -. | 2 | About 6.7% |
| S | ... | 3 | About 6.3% |
| H | .... | 4 | About 6.1% |
| Q | --.- | 4 | About 0.1% |
| Z | --.. | 4 | About 0.07% |
"E" and "T" are each represented by a single symbol (one dot, one dash). Making these two most frequent letters in English the shortest dramatically reduced telegraph transmission time. Meanwhile, "Q" and "Z" require 4 symbols, but since their frequency is below 0.1%, they barely affect overall transmission efficiency.
ETAOIN SHRDLU - English Letter Frequency Ranking
Arranging English letters by frequency gives "ETAOIN SHRDLU." This sequence has been known since the era of letterpress printing and was even adopted for the Linotype typesetting machine keyboard layout. The string "etaoin shrdlu" accidentally appearing in newspaper print was a common occurrence in the first half of the 20th century.
| Rank | Letter | Frequency | Morse Code Length | Frequency x Code Length |
|---|---|---|---|---|
| 1 | E | 12.70% | 1 | 0.127 |
| 2 | T | 9.06% | 1 | 0.091 |
| 3 | A | 8.17% | 2 | 0.163 |
| 4 | O | 7.51% | 3 | 0.225 |
| 5 | I | 6.97% | 2 | 0.139 |
| 6 | N | 6.75% | 2 | 0.135 |
| 7 | S | 6.33% | 3 | 0.190 |
| 8 | H | 6.09% | 4 | 0.244 |
| 9 | R | 5.99% | 3 | 0.180 |
| 10 | D | 4.25% | 3 | 0.128 |
The "Frequency x Code Length" column shows how much each letter contributes to overall transmission time. "E" has the highest frequency yet its contribution is held to 0.127 because its code length is 1. If "E" had been assigned 4 symbols, this value would jump to 0.508, dramatically increasing total transmission time.
However, Morse code is not perfectly optimized. "H" has the 6th highest frequency (6.09%) but is assigned 4 symbols, longer than the 3 symbols for 8th-ranked "R" (5.99%). This is thought to be because Morse considered not just type case observations but also how easy codes were to distinguish by ear.
Prescience Shared with Huffman Coding
In 1952, MIT graduate student David Huffman published an algorithm for constructing optimal variable-length codes for data compression. Huffman coding is essentially the same idea as Morse's - assigning short bit sequences to high-frequency symbols and long ones to low-frequency symbols.
| Comparison | Morse Code (1838) | Huffman Coding (1952) |
|---|---|---|
| Design principle | Short codes for frequent letters | Short bit sequences for frequent symbols |
| Optimality | Empirical/intuitive (not perfectly optimal) | Mathematically optimal (as prefix code) |
| Code types | 3-valued: dot, dash, gap | 2-valued: 0 and 1 |
| Delimiter mechanism | Inter-character/word gaps | No delimiter needed (prefix property) |
| Application | Telegraph communication | Data compression (ZIP, JPEG, MP3, etc.) |
The decisive difference is the "prefix property." Huffman codes are designed so no code is a prefix of another, allowing unique decoding by reading the bit sequence from the beginning. Morse code, on the other hand, uses gaps (silence) between characters as delimiters - without these gaps, "...." could be "H," "I + I," or "I + E + E." Considering that information theory didn't exist in Morse's time, the very idea of frequency-based code length optimization was remarkably prescient.
Japanese Morse Code Design - Why "イ" Got One of the Shortest Codes
Japanese Morse code (Wabun Morse) was established around 1855. In Wabun Morse, each katakana character is assigned a code. Like the English version, frequently used Japanese characters tend to get shorter codes, but the assignment doesn't perfectly follow frequency order.
| Character | Wabun Morse Code | Code Length | Notes |
|---|---|---|---|
| イ | .- | 2 | Among the shortest. Particle "i" is frequent |
| ロ | .-.- | 4 | 2nd in iroha order |
| ハ | -... | 4 | Frequent as particle "wa" but 4 symbols |
| ニ | -.-. | 4 | 4th in iroha order |
| ホ | -.. | 3 | 5th in iroha order |
| ヘ | . | 1 | Shortest. Used as particle "e" |
| ト | ..-.. | 5 | Frequent as particle "to" but 5 symbols |
Wabun Morse code assignments don't follow frequency order as strictly as the English version. "ヘ" is the shortest at 1 symbol (single dot), but it's not necessarily the most frequent character in Japanese text. "ト" is used very frequently as a particle yet is assigned 5 symbols. The Wabun Morse design is thought to reflect a mix of iroha ordering influence and considerations for auditory distinguishability.
SOS - The Rationality Packed into 9 Symbols
Late on the night of April 14, 1912, Titanic wireless operator Jack Phillips repeatedly transmitted the "SOS" signal. The Morse code for SOS is "... --- ..." - a total of 9 symbols. This signal was adopted as the international distress signal in 1906, but the reason it was chosen was not because it stands for "Save Our Souls."
The real reason SOS was chosen is its clarity as Morse code. Both "..." (S) and "---" (O) consist of repeated identical symbols, making them hard to mishear in noisy radio environments. Furthermore, the symmetrical rhythm of three dots, three dashes, three dots is unlikely to be confused with any other character sequence.
During the Titanic disaster, the older distress signal "CQD" (Come Quick, Danger) was initially used. CQD's Morse code is "-.-. --.- -.." at 12 symbols - 3 more than SOS's 9 symbols, making it less efficient for emergency transmission. Phillips switched to SOS partway through, and this decision is said to have expedited the alert to the rescue ship Carpathia.
Average Transmission Time Per Character
Morse code transmission speed is measured in "WPM" (Words Per Minute). Speed is defined by how many times the standard reference word "PARIS" can be sent per minute. "PARIS" was chosen as the reference because its Morse code length approximates the average English text code length.
The Morse code for "PARIS" is ".--. .- .-. .. ..." which totals 50 units when a dot length equals 1 unit. So 1 WPM = 50 units per minute. Skilled telegraph operators could transmit at 20-30 WPM, equivalent to 1,000-1,500 units per minute, or about 17-25 units per second.
Compared to modern text communication, Morse code transmission speed is extremely slow. But considering the technology level of the 1840s, being able to send messages in real-time to locations hundreds of kilometers away was revolutionary. As discussed in the SMS character limit article, SMS's 160-character limit also arose from technical constraints, but in the Morse code era, the concept of "character limits" didn't even exist - each character was manually keyed one by one.
Philosophical Connection to Modern Variable-Length Encoding
Morse code's design philosophy of "short codes for frequent characters" lives on in modern computing. The most familiar example is UTF-8 encoding.
| Character Type | UTF-8 Bytes | Example Characters | Design Intent |
|---|---|---|---|
| ASCII characters (alphanumeric) | 1 byte | A, B, 0, 1, @ | Shortest for most frequent characters in English |
| Latin Extended / Greek | 2 bytes | e, n, a, b | Additional European language characters |
| Japanese / Chinese / Korean | 3 bytes | あ, 漢, 한 | CJK characters at 3 bytes |
| Emoji / Special characters | 4 bytes | 😀, 🎉, 𠮷 | Supplementary plane characters |
UTF-8 represents ASCII characters (alphanumeric and symbols) - the most frequently used on the internet - in 1 byte, increasing byte count as usage frequency decreases. This is exactly the same optimization Morse performed at the type case. As explained in detail in the difference between character count and byte count, "あ" is 3 bytes in UTF-8 while "A" is 1 byte. Both are 1 character, but the data size differs by a factor of 3.
Understanding Unicode basics reveals the design philosophy of variable-length encoding more deeply. One reason UTF-8 is used by over 98% of websites worldwide is its backward compatibility - representing English text in the same 1 byte as ASCII. If all characters were fixed-length (say 4 bytes), English text file sizes would quadruple.
Numeric Morse Codes - Why All 5 Symbols
While letter code lengths vary from 1 to 4 symbols, numeric Morse codes (0-9) are all uniformly 5 symbols.
| Digit | Morse Code | Pattern |
|---|---|---|
| 1 | .---- | 1 dot + 4 dashes |
| 2 | ..--- | 2 dots + 3 dashes |
| 3 | ...-- | 3 dots + 2 dashes |
| 4 | ....- | 4 dots + 1 dash |
| 5 | ..... | 5 dots |
| 6 | -.... | 1 dash + 4 dots |
| 7 | --... | 2 dashes + 3 dots |
| 8 | ---.. | 3 dashes + 2 dots |
| 9 | ----. | 4 dashes + 1 dot |
| 0 | ----- | 5 dashes |
Numbers are uniformly 5 symbols because digits don't have the frequency bias that letters do. In English text, "E" appears 12.7% of the time while "Z" appears only 0.07%, but digit frequency varies greatly by context. Phone numbers are roughly uniform, while monetary amounts have more "0"s. With no rational basis for making specific digits shorter, all were made the same length.
Furthermore, the numeric codes have a beautiful regularity. From 1 to 5, dots increase one by one until 5 is all dots; from 6 to 0, dashes increase one by one until 0 is all dashes. This symmetrical pattern is easy to memorize, reducing telegraph operator training time.
Calculating Morse Code Transmission Efficiency
Let's quantitatively evaluate Morse code transmission efficiency from an information theory perspective. The information content (entropy) per character of English text is about 4.7 bits. Meanwhile, transmitting English text via Morse code requires an average code length of about 8.1 time units per character (with dot length as 1 unit).
If Morse had used completely random code assignment (ignoring frequency), the average code length would have been approximately 10.2 time units. This means Morse's frequency-based design achieved about 20% transmission time reduction compared to random assignment.
With theoretically optimal Huffman coding, the average code length would be about 7.6 time units. Morse code's 8.1 time units is only about 7% from the optimal value - remarkable precision for an empirical 19th-century design.
The Meaning of Counting Character Efficiency
About 190 years after Morse code's design, "character efficiency" remains an issue everywhere. In X (formerly Twitter) character limits, one Japanese character and one English character are counted as the same "1 character," but their information content differs greatly. A single Japanese kanji compresses the meaning of several English words, so a 140-character Japanese tweet conveys far more information than a 140-character English tweet.
When Morse was counting characters in front of the type case, he was tackling the universal problem of "efficient information transmission." That problem has been inherited in changing forms by modern web developers puzzling over the difference between fullwidth and halfwidth characters, and by generative AI users optimizing prompt character counts. Behind the act of counting characters always lies the essence of information theory: "conveying maximum information with limited resources."
Books on information theory and Morse code can be found on Amazon.