Compression Ratio
In data compression, the ratio of the compressed size to the original size. Text data is highly redundant and can typically achieve compression ratios of 60% to 80%.
Compression ratio is a metric that measures the efficiency of data compression. If the original data size is D and the compressed size is C, the compression ratio is calculated as (1 - C/D) x 100%. A 100 KB text file compressed to 25 KB has a compression ratio of 75%. A higher compression ratio means the data can be stored and transmitted using less storage and network bandwidth.
Text data tends to achieve higher compression ratios than images or video. Natural language text contains many forms of redundancy: skewed character frequency distributions (in English, "e" is the most common letter), repeated words, and formulaic phrases. Compressing English text with gzip typically yields a 60% to 70% compression ratio. Japanese text encoded in UTF-8 achieves roughly 50% to 65%. The slightly lower ratio for Japanese is due to the large variety of kanji characters, which makes the character frequency distribution less skewed than in English.
On the web, gzip and Brotli compression of HTTP responses dramatically reduces the transfer size of text-based content. HTML, CSS, and JavaScript are all text data and benefit greatly from compression. Brotli achieves 15% to 25% better compression than gzip, and all major browsers support it. If a 10 KB HTML file is compressed to 2.5 KB with Brotli, the page load time improvement is perceptible to users. Web performance books on Amazon cover compression strategies in depth.
Text compression algorithms fall into two broad categories. Huffman coding assigns variable-length bit sequences based on character frequency, representing common characters with shorter sequences. LZ77/LZ78-family algorithms detect repeated patterns in the text and replace them with references to earlier occurrences (position and length). gzip uses the DEFLATE algorithm, which combines both approaches.
The relationship between character count and compression ratio has interesting properties. Two texts with the same character count can have vastly different compression ratios depending on their content. A string of the same character repeated (such as "aaaaaaaaaa") compresses extremely well, while a random character string is nearly incompressible. This connects directly to the concept of entropy in information theory: the more redundant the text, the higher the compression ratio.
In practice, compression ratio translates directly into storage cost and bandwidth optimization. When storing large volumes of log files or chat histories, text compression can reduce storage costs by 60% to 80%. Compressing API responses shortens response times on mobile networks and improves the user experience. The longer the text, the greater the benefit of compression, so for long-form content delivery, whether compression is enabled can make a noticeable difference in performance.