How to : (de)compress data using LZW algo

Written by : Peter Quiring (Jan 4/97)


This tutorial is to help you understand the complex LZW algorithom so you can compress data. The LZW method I'm going to show you here may not be the exact algo used in the LHA file format nor am I going to show you the file format of it, I'm just going to go over a variation of the LZW compression that I learned and may have modified slightly. This LZW is almost good enough to decompress GIFs (with some tweaking it could I am told)

How does it work?



Data compression in it simplest form is just removing redundant data. In the LZW algo redundant data is removed as best as possible. It only makes one pass over the data. Oviously more passes and using different algos would increase data compress (which ARJ and ZIP do), but I'm just going to show you LZW.
How the LZW compression works is that as data is taken, strings of it are placed into a hash table. Each string in the hash table is given a code. If new data coming in is already in the hash table then the new data is replaced with the code of that in the hash table. And this is repeated until the hash table is full, it's then flushed and the compression continues until it reaches the end of data. At that point a code is outputed to signal end of data.
The input data can be of any size (4bits, 8bits, etc.) but I will work on the 8bit size only because any other size can be represent as 8bit data. Ok, so 8bits gives us 256 possible numbers, so which do we use for our code numbers that will be used in that hash table. Well we can not use these because they are for uncompressed data only, so what we do is add 1 bit to our output data. Therefore our output size is 9 bits and we can now use the numbers 256 thru 511 as codes for our hash table. But before we use code 256 for the 1st entry in our hash table we need to reserve some codes for special conditions.
256 will be our clear code. This will be outputed when ever the hash table needs to be cleared.
257 will be our end of file code. This will be outputed when we reach the end of the input data.
This leaves codes 258 thru 511 for our hash table. Well that is a really small hash table so what we will do is when we reach 512 we will increase the bit size of our output. So now our output size is 10bits. This gives our hash table codes 258 thru 1023. You can keep adding as many bits as you like but the decompression MUST use the same size. The max default used by LZW (and me) is 12bits. This gives us codes 258 thru 4095. Codes 0 thru 255 are for normal uncompressed data that could not be represented by a string in our hash table. When we reach 4096 we output the clear code (256) and then restart at code 258.

Compressing data:



To start off I'll explain some things we need to compress. The strings that are used in compression are split up into 2 parts. A prefix and a suffix, each is a code. If the code is from 0-255 (8bit) then it simply represents that number. And remember that 256 (100h) and 257 (101h) are special codes which are never used as a string part. If they are above 258 then they are a hash table entry number. So if it is 258 then it represents the whole string at that location within the hash table. The prefix and suffix parts are stored into the hash table as a pair to represent one string. And as the hash table grows, so does it's complexity. The suffix is always 8bit (0-255) but the prefix can be upto the max (12bits). To start off the data compression we take 2 bytes from the input data stream and assign the 1st as the prefix and the 2nd as the suffix. Then we place those into the hash table at 258 (102h) location. Then we output only the prefix. That is our first string. Then we get two more bytes starting from the last byte of the last string. If this string...
a) is NOT in our hash table it is put in our hash table in the next location and the prefix is outputed.
b) is in our hash table then our prefix becomes the code of the string already in the hash table and we get another byte from the input data stream which becomes our new suffix code.
And this is repeated until certain conditions:
a) we run out of data at which point we output the prefix and the suffix and then the EOD (end of data) code 257 (101h).
b) we reach code 4096 where we must send the reset code 256 (100h) which will be sent with 12bits (the max). And then we start over using code 258 for our 1st string.

Compression example:


Here is puesudo-code to show you how it is done:

start:
  code=258
restart:
  prefix=getbyte()
  suffix=getbyte()
  hash.pre[code]=prefix
  hash.sux[code]=suffix
  output(prefix)
goto1:
  code=code+1
  prefix=suffix  ;use last byte for prefix
  suffix=getbyte()
check:
  for (a=258;a<code;a++) {
    if prefix==hash.pre[a] && suffix==hash.suf[a] {
      prefix=a
      suffix=getbyte()
      goto check
    }
  }
  ;a new string is found
  hash.pre[code]=prefix
  hash.suf[code]=suffix
  output(prefix)
  goto goto1
Of course this is very basic and does not check for any conditions such as increasing the size of the output and such. There should be a variable that holds the number of bits that should be outputed, lets call it 'nb' (Number of Bits). nb should be init 9. And right after we increment code we should do the following:
  if (code==512 || code==1024 || code==2048) nb=nb+1
Then we should also include a line right after that if code==4096 in which case we must flush the whole system
  if (code==4096) {
    code=258
    output(suffix)
    output(256)  ;the clear code
    nb=9         ;reset to 9 bits
    goto restart
  }
Now the function getbyte() should also check for when the input data is all consumed at which point we must output any information that has not been output()ed. But this becomes a little difficult because at each point that getbyte() is used the action of EOF is different. So getbyte() should return an error code that should be checked after calling it and if an error is detected it should finish up at that point. The output() function is a very difficult thing, it must be able to output 9,10,11 and 12 bits to output area. The easiest way is to do that is 1 bit at a time (using shl and shr is the best way), although there are faster ways but it's really hard. So here is a more complete example:

unsigned char *input_data;
int bufip;  //BUFfer Input Pos

int getbyte(void) {
  ;get data how ever you will do it
  if endofdata? {
    return -1
  } else {
    return (input_data[bufip++]);
  }
}

int bo;      //Bit Offset
int bufop;   //BUFfer Output Pos
unsigned char *output_data;

output(int dat) {
  ;This is difficult to do in C, just make sure you output only 9,10,11
  ;or 12bit at a time.  Here is what it might look like
  int a;
  int b;

  for (a=0;a>=(nb-a); //get next bit
    b&=1;       //mask bit #1
    b<<=bo;     //bo is our Bit Offset (global variable)
    output_data[bufop]||=b;
    if (bo!=0) bo--;
      else {
        bo=7;
        output_data[++bufop]=0;
      }
  }
}

start:
  code=258
  bo=7     ;start at bit 7
  bufop=0  ;start output buffer at 0
  bufip=0  ;start input buffer at 0
  output_data[0]=0;
restart:
  prefix=getbyte()
  suffix=getbyte()
  if (prefix==-1) goto end
  if (suffix==-1) {
    output(prefix)
    goto end
  }
  hash.pre[code]=prefix
  hash.sux[code]=suffix
  output(prefix)
goto1:
  code=code+1
  if (code==512 || code==1024 || code==2048) nb=nb+1
  if (code==4096) {
    code=258
    output(suffix) ;output last code
    output(256)    ;the clear code
    nb=9           ;reset to 9 bits
    goto restart
  }
  prefix=suffix    ;use last byte for prefix
  suffix=getbyte()
  if (suffix==-1) {
    output(prefix)
    goto end
  }
check:
  for (a=258;a<code;a++) {
    if prefix==hash.pre[a] && suffix==hash.suf[a] {
      prefix=a
      suffix=getbyte()
      goto check
    }
  }
  ;a new string is found
  hash.pre[code]=prefix
  hash.suf[code]=suffix
  output(prefix)
  goto goto1
end:
And that's it. You'll probably need an example to see what is exactly happening so here is one:

 Input data = "abcabcaabcd"  ;just some random letters

 1st step: prefix=a, suffix=b, output(prefix)
  Output data = "a"
  Hash table:  PRE SUX
       102h :    a   b

 2nd step: prefix=suffix which is b, suffix=c, output(prefix)
  Output data = "ab"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c

 3rd step: prefix=c, suffix=a, output(prefix)
  Output data = "abc"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a

 4th step: prefix=a, suffix=b,  BUT this is already in the table
                                so prefix=102h which is the code
                                in the hash table with the string
                                and then suffix=c (the next code)
                                and this is not in the hash table
                                output(prefix)
  Output data = "abc(102h)"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c  ;if this is expanded it will be "abc"

 5th step: prefix=c, suffix=a, output(prefix)
  Output data = "abc(102h)c"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c
       106h :    c   a

 6th step: prefix=a, suffix=a, output(prefix)
  Output data = "abc(102h)ca"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c
       106h :    c   a
       107h :    a   a

 7th step: prefix=a, suffix=b  BUT this is already in the hash table
           prefix=102h, suffix=c  BUT this is too
           prefix=105h, suffix=d  ok this is a new string
           output(prefix)
  Output data = "abc(102h)ca(105h)"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c
       106h :    c   a
       107h :    a   a
       108h :  105h  d

 8th step: we are out of data so we clean up
   output(suffix)  ;'d'
   output(100h)    ;clear code

 Final Output data = "abc(102h)ca(105h)d"

So we compressed 11 bytes (88bits) into 8 codes. But each code is 9bit giving us 72 bits. If you look at the hash table you can easily see that our output data is running down the PREfix collumn except for the last code in the SUX (suffix) location.
And if you look harder you will see that the 1st char of each string was the last char (suffix) of the last string. If we expand 105h we get "abc" and the 'a' was the previous suffix. If we expand 108h we get "abcd" and the 'a' was our previous suffix. This fact is very important later on in decompression.

Decompressing data:



Well if you thought compression was hard decompression is harder.
Now we have to use the input data and try to recreat the hash table with it and regenerate the original data, and it's tough.
We'll need two new variables called 'nc' (new code) and 'oc' (old code). We start by taking one code and outputing that and assigning it to our oc. Then we get another code and assign it to nc. Then we output the whole string that nc represents. Then suffix becomes the very 1st letter of nc (which will simply be nc if it's < 100h). And oc is assigned to prefix. Then these two are put into the hash table into the next open pos. Then nc is assigned to oc and the whole process is repeated until it's done. Here some simply code: (Note: the functions used here are nothing like those used in the compression routine)
start:
  oc=getbyte()
  output(oc)
  code=102h
restart:
  nc=getbyte()
  suffix=outstr(nc)  ;outstr will output nc and return the 1st
                     ; char of the string
  prefix=oc
  hash.pre[code]=prefix
  hash.sux[code]=suffix
  oc=nc
  code=code+1
  goto restart
Of course once again that is very basic and does not check for EOF or anything else. Let's try using the data from the compression demo.

 Input data = "abc(102h)ca(105h)d"  ;from the final compressed data

 1st step: oc=a, output(oc)
  Output data = "a"

 2nd step: nc=b, suffix=b (outstr(nc)), prefix=a(oc), oc=b(nc)
  Output data = "ab"
   And after each step we place prefix and suffix into out hash table
  Hash table:  PRE SUX
       102h :    a   b

 3rd step: nc=c, suffix=c (outstr(nc)), prefix=b(oc), oc=c(nc)
  Output data = "abc"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c

 4th step: nc=102h, suffix=a (outstr(nc)) ;the 1st char of 102h is 'a'
           prefix=c(oc), oc=102h(nc)
  Output data = "abcab"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a

 5th step: nc=c, suffix=c (outstr(nc)), prefix=102h(oc), oc=c(nc)
  Output data = "abcabc"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c

 6th step: nc=a, suffix=a (outstr(nc)), prefix=c(oc), oc=a(nc)
  Output data = "abcabca"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c
       106h :    c   a

 7th step: nc=105h, suffix=a (outstr(nc)), prefix=a(oc), oc=105h(nc)
  Output data = "abcabcaabc"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c
       106h :    c   a
       107h :    a   a

 8th step: nc=d, suffix=d (outstr(nc)), prefix=105h(oc), oc=d(nc)
  Output data = "abcabcaabcd"
  Hash table:  PRE SUX
       102h :    a   b
       103h :    b   c
       104h :    c   a
       105h :  102h  c
       106h :    c   a
       107h :    a   a
       108h :  105h  d  

 Final output data = "abcabcaabcd" ;which is what is started with from the
                                   ;very begining
Well that is for that, there is just one more major thing. Try to compress this data and then decompress it:
data = "aaaa"
When you compress it you should get:
Compressed data = "a(102h)a"
And now let's try to decompress this, The 1st step is easy:
 1st step:oc=a, output(oc)
  Output data = "a"

 2nd step: nc=102h, suffix=a (outstr(nc)), prefix=a(oc), oc=102h(nc)
But the second is very tricky, here nc will be 102h and then we use outstr() to output nc. But 102h is not yet in the hash table, why? Well remember when I showed you that the suffix of a code is the same as the 1st char of the next string. Well when we regenerate the hash table here we really are just using the 1st char of the last string to find out the suffix and if the nc is the code we are just trying to create then there is a problem. But the solution is easy, in this case all we do is let nc equal oc (just for the duration of the output() function, we do not litterilly assign oc to nc) and then we output suffix once after then output(nc) call. And this fixes our problem. This is because the 1st and last char of the next string is the same. If this is confusing well take a look at the rest of the output process:
  2nd step: nc=102h, suffix=a (output(nc)) BUT nc is not yet defined so
                                           we use do this instead
                                           suffix=output(oc), output(suffix)
            prefix=a(oc), oc=102h(nc)
   Output data = "aaa"
   Hash table:  PRE SUX
        102h :    a   a

  3rd step: nc=a, suffix=a (output(nc)), prefix=102h(oc), oc=a(nc)
   Output data = "aaaa"
   Hash table:  PRE SUX
        102h :    a   a
        103h :  102h  a

  Final Output data = "aaaa"
And that's it, here's the final code:

unsigned char *input_data;
int bufip;  //BUFfer Input Pos
int bo;     //Bit Offset

int getbyte(void) {
  ;get data must get data of size 9,10,11 or 12 bits
  ;This is difficult to do in C, just make sure you output only 9,10,11
  ;or 12bit at a time.  Here is what it might look like
  int a;
  int b;
  int dat;
  int ret;

  ret=0;  //to be returned
  dat=input_data[bufip];
  for (a=0;a>=bo;     //get next bit
    b&=1;       //mask bit #1
    ret<<=1;
    ret|=b;
    if (bo!=0) bo--;
      else {
        bo=7;
        dat=input_data[++bufip];
      }
  }
  return ret;
}

int bufop;   //BUFfer Output Pos
unsigned char *output_data;

output(unsigned char dat) {
  output_data[bufop++]=dat;
}

int code;

outstr(int dat) {
  ;this is very difficult to implement in pseudo-code, it requires
  ;some sort of stack.  Check out the ASM file at the end of the file
}

start:
  code=258
  bo=7     ;start at bit 7
  bufop=0  ;start output buffer at 0
  bufip=0  ;start input buffer at 0
  output_data[0]=0;
  oc=getbyte()
  output(oc)
restart:
  nc=getbyte()
  suffix=outstr(nc)  ;outstr will output nc and return the 1st
                     ; char of the string
  prefix=oc
  hash.pre[code]=prefix
  hash.sux[code]=suffix
  oc=nc
  code=code+1
  goto restart
Well that's as best as I can do in 2 days. That final code is not really that great but here is the full thing:

LZW ASM Program : The full source of my LZW data compression and decompression functions (which are integrated already into QLIB v2.00).

Last notes:
Well that's it, I hope you can get it right away cause I've been programming that bloody thing for about 2 years now (on and off). It was originally working once in C code but I've just recently completely changed it to ASM source (sorry, I lost the C version in the process - from now on I'm going to save all my projects at every stage of development, starting now).
Anyways...hope you learn from it and enjoyed it as much as I had writing it.
Written by : Peter Quiring
Email : pquiring@geocities.com

Copyright © 1995-2005 Nexus Systems Privacy
SourceForge.net Logo