Written by : Peter Quiring (Jan 4/97)
start: code=258 restart: prefix=getbyte() suffix=getbyte() hash.pre[code]=prefix hash.sux[code]=suffix output(prefix) goto1: code=code+1 prefix=suffix ;use last byte for prefix suffix=getbyte() check: for (a=258;a<code;a++) { if prefix==hash.pre[a] && suffix==hash.suf[a] { prefix=a suffix=getbyte() goto check } } ;a new string is found hash.pre[code]=prefix hash.suf[code]=suffix output(prefix) goto goto1Of course this is very basic and does not check for any conditions such as increasing the size of the output and such. There should be a variable that holds the number of bits that should be outputed, lets call it 'nb' (Number of Bits). nb should be init 9. And right after we increment code we should do the following:
if (code==512 || code==1024 || code==2048) nb=nb+1Then we should also include a line right after that if code==4096 in which case we must flush the whole system
if (code==4096) { code=258 output(suffix) output(256) ;the clear code nb=9 ;reset to 9 bits goto restart }Now the function getbyte() should also check for when the input data is all consumed at which point we must output any information that has not been output()ed. But this becomes a little difficult because at each point that getbyte() is used the action of EOF is different. So getbyte() should return an error code that should be checked after calling it and if an error is detected it should finish up at that point. The output() function is a very difficult thing, it must be able to output 9,10,11 and 12 bits to output area. The easiest way is to do that is 1 bit at a time (using shl and shr is the best way), although there are faster ways but it's really hard. So here is a more complete example:
unsigned char *input_data; int bufip; //BUFfer Input Pos int getbyte(void) { ;get data how ever you will do it if endofdata? { return -1 } else { return (input_data[bufip++]); } } int bo; //Bit Offset int bufop; //BUFfer Output Pos unsigned char *output_data; output(int dat) { ;This is difficult to do in C, just make sure you output only 9,10,11 ;or 12bit at a time. Here is what it might look like int a; int b; for (a=0;aAnd that's it. You'll probably need an example to see what is exactly happening so here is one:>=(nb-a); //get next bit b&=1; //mask bit #1 b<<=bo; //bo is our Bit Offset (global variable) output_data[bufop]||=b; if (bo!=0) bo--; else { bo=7; output_data[++bufop]=0; } } } start: code=258 bo=7 ;start at bit 7 bufop=0 ;start output buffer at 0 bufip=0 ;start input buffer at 0 output_data[0]=0; restart: prefix=getbyte() suffix=getbyte() if (prefix==-1) goto end if (suffix==-1) { output(prefix) goto end } hash.pre[code]=prefix hash.sux[code]=suffix output(prefix) goto1: code=code+1 if (code==512 || code==1024 || code==2048) nb=nb+1 if (code==4096) { code=258 output(suffix) ;output last code output(256) ;the clear code nb=9 ;reset to 9 bits goto restart } prefix=suffix ;use last byte for prefix suffix=getbyte() if (suffix==-1) { output(prefix) goto end } check: for (a=258;a<code;a++) { if prefix==hash.pre[a] && suffix==hash.suf[a] { prefix=a suffix=getbyte() goto check } } ;a new string is found hash.pre[code]=prefix hash.suf[code]=suffix output(prefix) goto goto1 end:
Input data = "abcabcaabcd" ;just some random letters 1st step: prefix=a, suffix=b, output(prefix) Output data = "a" Hash table: PRE SUX 102h : a b 2nd step: prefix=suffix which is b, suffix=c, output(prefix) Output data = "ab" Hash table: PRE SUX 102h : a b 103h : b c 3rd step: prefix=c, suffix=a, output(prefix) Output data = "abc" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 4th step: prefix=a, suffix=b, BUT this is already in the table so prefix=102h which is the code in the hash table with the string and then suffix=c (the next code) and this is not in the hash table output(prefix) Output data = "abc(102h)" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c ;if this is expanded it will be "abc" 5th step: prefix=c, suffix=a, output(prefix) Output data = "abc(102h)c" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c 106h : c a 6th step: prefix=a, suffix=a, output(prefix) Output data = "abc(102h)ca" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c 106h : c a 107h : a a 7th step: prefix=a, suffix=b BUT this is already in the hash table prefix=102h, suffix=c BUT this is too prefix=105h, suffix=d ok this is a new string output(prefix) Output data = "abc(102h)ca(105h)" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c 106h : c a 107h : a a 108h : 105h d 8th step: we are out of data so we clean up output(suffix) ;'d' output(100h) ;clear code Final Output data = "abc(102h)ca(105h)d"So we compressed 11 bytes (88bits) into 8 codes. But each code is 9bit giving us 72 bits. If you look at the hash table you can easily see that our output data is running down the PREfix collumn except for the last code in the SUX (suffix) location.
start: oc=getbyte() output(oc) code=102h restart: nc=getbyte() suffix=outstr(nc) ;outstr will output nc and return the 1st ; char of the string prefix=oc hash.pre[code]=prefix hash.sux[code]=suffix oc=nc code=code+1 goto restartOf course once again that is very basic and does not check for EOF or anything else. Let's try using the data from the compression demo.
Input data = "abc(102h)ca(105h)d" ;from the final compressed data 1st step: oc=a, output(oc) Output data = "a" 2nd step: nc=b, suffix=b (outstr(nc)), prefix=a(oc), oc=b(nc) Output data = "ab" And after each step we place prefix and suffix into out hash table Hash table: PRE SUX 102h : a b 3rd step: nc=c, suffix=c (outstr(nc)), prefix=b(oc), oc=c(nc) Output data = "abc" Hash table: PRE SUX 102h : a b 103h : b c 4th step: nc=102h, suffix=a (outstr(nc)) ;the 1st char of 102h is 'a' prefix=c(oc), oc=102h(nc) Output data = "abcab" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 5th step: nc=c, suffix=c (outstr(nc)), prefix=102h(oc), oc=c(nc) Output data = "abcabc" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c 6th step: nc=a, suffix=a (outstr(nc)), prefix=c(oc), oc=a(nc) Output data = "abcabca" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c 106h : c a 7th step: nc=105h, suffix=a (outstr(nc)), prefix=a(oc), oc=105h(nc) Output data = "abcabcaabc" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c 106h : c a 107h : a a 8th step: nc=d, suffix=d (outstr(nc)), prefix=105h(oc), oc=d(nc) Output data = "abcabcaabcd" Hash table: PRE SUX 102h : a b 103h : b c 104h : c a 105h : 102h c 106h : c a 107h : a a 108h : 105h d Final output data = "abcabcaabcd" ;which is what is started with from the ;very beginingWell that is for that, there is just one more major thing. Try to compress this data and then decompress it:
1st step:oc=a, output(oc) Output data = "a" 2nd step: nc=102h, suffix=a (outstr(nc)), prefix=a(oc), oc=102h(nc)But the second is very tricky, here nc will be 102h and then we use outstr() to output nc. But 102h is not yet in the hash table, why? Well remember when I showed you that the suffix of a code is the same as the 1st char of the next string. Well when we regenerate the hash table here we really are just using the 1st char of the last string to find out the suffix and if the nc is the code we are just trying to create then there is a problem. But the solution is easy, in this case all we do is let nc equal oc (just for the duration of the output() function, we do not litterilly assign oc to nc) and then we output suffix once after then output(nc) call. And this fixes our problem. This is because the 1st and last char of the next string is the same. If this is confusing well take a look at the rest of the output process:
2nd step: nc=102h, suffix=a (output(nc)) BUT nc is not yet defined so we use do this instead suffix=output(oc), output(suffix) prefix=a(oc), oc=102h(nc) Output data = "aaa" Hash table: PRE SUX 102h : a a 3rd step: nc=a, suffix=a (output(nc)), prefix=102h(oc), oc=a(nc) Output data = "aaaa" Hash table: PRE SUX 102h : a a 103h : 102h a Final Output data = "aaaa"And that's it, here's the final code:
unsigned char *input_data; int bufip; //BUFfer Input Pos int bo; //Bit Offset int getbyte(void) { ;get data must get data of size 9,10,11 or 12 bits ;This is difficult to do in C, just make sure you output only 9,10,11 ;or 12bit at a time. Here is what it might look like int a; int b; int dat; int ret; ret=0; //to be returned dat=input_data[bufip]; for (a=0;aWell that's as best as I can do in 2 days. That final code is not really that great but here is the full thing:>=bo; //get next bit b&=1; //mask bit #1 ret<<=1; ret|=b; if (bo!=0) bo--; else { bo=7; dat=input_data[++bufip]; } } return ret; } int bufop; //BUFfer Output Pos unsigned char *output_data; output(unsigned char dat) { output_data[bufop++]=dat; } int code; outstr(int dat) { ;this is very difficult to implement in pseudo-code, it requires ;some sort of stack. Check out the ASM file at the end of the file } start: code=258 bo=7 ;start at bit 7 bufop=0 ;start output buffer at 0 bufip=0 ;start input buffer at 0 output_data[0]=0; oc=getbyte() output(oc) restart: nc=getbyte() suffix=outstr(nc) ;outstr will output nc and return the 1st ; char of the string prefix=oc hash.pre[code]=prefix hash.sux[code]=suffix oc=nc code=code+1 goto restart