In 1977, two Israeli information theorists,
Abraham
Lempel and Jacob
Ziv, introduced a radically different way of compressing data - one
which would avoid the uncertainty of making predictions or the wastefulness
of pre-reading data. LZ77 and LZ78, two dictionary-based data compression
techniques described by these two researchers, provided a whole new way
of viewing the world of data compression.
Table of Contents
Lempel and Ziv's compression techniques of '77 and '78 gave rise to a series of variants which form part of the LZ family. The variants are essentially identical to the method from which they originate (either LZ77 or LZ78). Variations have to do with factors such as establishing how many bits to use for certain boundaries. A list of these coding algorithms is provided below (Table 1).
| LZ77 Variants | LZR | LZSS | LZB | LZH | |||
|---|---|---|---|---|---|---|---|
| LZ78 Variants | LZW | LZC | LZT | LZMW | LZJ | LZFG | |
As an example, the zip and unzip methods use the LZH techique. UNIX compress methods, on the other hand, belong to the LZW and LZC classes.
A question that would naturally arise at this point is "Why so many
variants?" The answer is...patents. Disputes over
patenting have actually slowed down the evolutionary progress of the LZ
method.
Table of Contents
Figure 1: A stream of a's and b's in a sample text
Rule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.
According to the above rule, we see that the first piece of our sample text is a. The second piece must then be aa. If we go on like this, we obtain the breakdown of data illustrated in Figure 2:
Figure 2: How a sample text gets broken up
Figure 3: Indexing the pieces of our sample text
These indices are used to number the pieces of data. The empty string (start of text) has index 0. The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa. String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a. In this way, we proceed to number all the pieces in terms of those that came before them. See Figure 4 for an illustration of this technique:
Figure 4: "Numbering" each phrase in the text
As we can see, this process of renaming pieces of text starts to pay
off. Small integers replace what were once long strings of characters.
We can now throw away our old stream of text and send the encoded information
to the receiver, as will be discussed in the next section.
Table of Contents
The number of bits needed to represent each integer with index n is at most equal to the number of bits needed to represent the (n-1)th index. For example, the number of bits needed to represent 6 in piece 8 is equal to 3, because it takes three bits to express 7 (the (n-1)th index) in binary. The integer in this piece could be any of the numbers representing the previous strings(i.e. at piece 8 we could have any number between 0..7 inclusive). There are a total of 8 different strings that preceed this one and to represent any one of them we need atleast 3 bits. Same line of reasoning applies to the number of bits allocated to other pieces.
Every character will occupy 8 bits because it will be represented in US ASCII format. Figure 5 illustrates this thinking process:
Figure 5: Calculating the number of bits occupied by each phrase
One of the advantages of Lempel-Ziv compression is that in a long string of text, the number of bits needed to transmit the coded information is peanuts compared to the actual length of the text. This becomes apparent when we use, for example, 12 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.
Property: On text files with independent random symbols occuring with probabilities p1, p2, p3,...,pk, the expected number of bits needed per symbol tends to the entropy:

The expected number of bits per symbol, in any compression method and
using the random independent symbol model, is at least equal to the entropy
(based on information-theoretical considerations). The Huffman code achieves
this, but it must have knowledge of the probabilities. Interestingly, the
Lempel-Ziv code performs equally well (on large input files), without that
knowledge. It is thus truly adaptive and asymptotically optimal.
Table of Contents
It is preferrable to decompress the file in one pass; otherwise, we will encounter a problem with temporary storage. This feature allows you to process many megs of information at one time.
Given the code, the receiver constructs a trie "on the fly". This must be identical to the trie contructed by the sender. It is illustrated below (Figure 6):
Figure 6: The trie built from our sample stream of a's and b's
Note: To find the contents of a particular piece, trace the path from the node to the root. For example, piece 2 will contain 1a.
The structure shown above is called a digital search tree and is part
of the trie family. It is different from the standard trie because the
internal nodes do not live only in the leaves.
Table of Contents
The Binary button converts text into a compressed form. Do note that
the vertical bars wouldn't appear normally, but are just there to improve
legibility. Also, since Java doesn't do USASCII, the applet translates
characters into numbers using some inherent Java scheme.
Detailed comments are also available.
For the source, click here.
Table of Contents
| Lempel-Ziv-Welch Algorithm | Dr. Ze-Nian Li's lecture notes on the popular LZW variant of the LZ data compression method, including algorithms for LZW compression and decompression, from Simon Fraser University. |
| CIS 650: Information Storage and Retrieval: Compression | LZW compression, decompression, implementation issues, and algorithms by Prof. Gary Perlman, of Ohio State University. |
| LZW and GIF explained ---- Steve Blackstock | A document from The PC Games Programmers Encylopedia 1.0, describing the basics of the Lempel-Ziv-Welch compression algorithm, by Steve Blackstock. |
| Data Compression: OTHER ADAPTIVE METHODS | Prof. Dan Hirschberg's (University of California, Irvine) notes illustrating and explaining the LZ coding technique. |
| The Redundancy of the Ziv-Lempel Algorithm for Memoryless Sources | A discussion of the redundancy of the Lempel-Ziv algorithm by Tjalling Tjalkens. |
| A sample problem on LZ Data Compression (Problem B) | An exercise involving the Lempel-Ziv compression and decompression methods, including explanations, by Aaron D. Wyner, from the IEEE Information Theory Society. |
| Part IV: Data Compression | David Cyganski, of the Worceseter Polytechnic Institute, presents here another illustrative set of notes on Lempel-Ziv encoding. |
| LZW compression | A summary on LZW Compression by Christopher Chisholm, from the Curtin School of Computing. |
2. Nelson, M. The Data Compression Book. MIT PUBL Inc, 1992
3. Lynch, Thomas J. Data Compression Techniques and Applications. Belmont California: Lifetime Learning Publications, 1988.
4. Storer, James A. Data Compression: Methods and Theory. Rockville,
Maryland: Computer Science Press, 1988.
Edited by Haroon Ali Agha (hagha@cgm.cs.mcgill.ca)
June 20, 1999
"SHMART...KILL KILL!!! Call me Abbie!"
Last updated June 27, 1999 by
Haroon Ali Agha
Copyright © 1997, Hala Khan, Patricia Bejerman, Patrick Lam. All rights reserved. Reproduction of all or part of this work is permitted for educational or research use provided that this copyright notice is included in any copy. Disclaimer: this collection of notes is experimental, and does not serve as-is as a substitute for attendance in the actual class.