Written by : Peter Quiring (Jan 4/97)
start:
code=258
restart:
prefix=getbyte()
suffix=getbyte()
hash.pre[code]=prefix
hash.sux[code]=suffix
output(prefix)
goto1:
code=code+1
prefix=suffix ;use last byte for prefix
suffix=getbyte()
check:
for (a=258;a<code;a++) {
if prefix==hash.pre[a] && suffix==hash.suf[a] {
prefix=a
suffix=getbyte()
goto check
}
}
;a new string is found
hash.pre[code]=prefix
hash.suf[code]=suffix
output(prefix)
goto goto1
Of course this is very basic and does not check for any conditions such
as increasing the size of the output and such.
There should be a variable that holds the number of bits that should be
outputed, lets call it 'nb' (Number of Bits). nb should be init 9. And
right after we increment code we should do the following:if (code==512 || code==1024 || code==2048) nb=nb+1Then we should also include a line right after that if code==4096 in which case we must flush the whole system
if (code==4096) {
code=258
output(suffix)
output(256) ;the clear code
nb=9 ;reset to 9 bits
goto restart
}
Now the function getbyte() should also check for when the input data is all
consumed at which point we must output any information that has not been
output()ed. But this becomes a little difficult because at each point that
getbyte() is used the action of EOF is different. So getbyte() should return
an error code that should
be checked after calling it and if an error is detected it should finish
up at that point. The output() function is a very difficult thing, it must
be able to output 9,10,11 and 12 bits to output area. The easiest way
is to do that is 1 bit at a time (using shl and shr is the best way), although
there are faster ways but it's really hard. So here is a more complete
example:
unsigned char *input_data;
int bufip; //BUFfer Input Pos
int getbyte(void) {
;get data how ever you will do it
if endofdata? {
return -1
} else {
return (input_data[bufip++]);
}
}
int bo; //Bit Offset
int bufop; //BUFfer Output Pos
unsigned char *output_data;
output(int dat) {
;This is difficult to do in C, just make sure you output only 9,10,11
;or 12bit at a time. Here is what it might look like
int a;
int b;
for (a=0;a>=(nb-a); //get next bit
b&=1; //mask bit #1
b<<=bo; //bo is our Bit Offset (global variable)
output_data[bufop]||=b;
if (bo!=0) bo--;
else {
bo=7;
output_data[++bufop]=0;
}
}
}
start:
code=258
bo=7 ;start at bit 7
bufop=0 ;start output buffer at 0
bufip=0 ;start input buffer at 0
output_data[0]=0;
restart:
prefix=getbyte()
suffix=getbyte()
if (prefix==-1) goto end
if (suffix==-1) {
output(prefix)
goto end
}
hash.pre[code]=prefix
hash.sux[code]=suffix
output(prefix)
goto1:
code=code+1
if (code==512 || code==1024 || code==2048) nb=nb+1
if (code==4096) {
code=258
output(suffix) ;output last code
output(256) ;the clear code
nb=9 ;reset to 9 bits
goto restart
}
prefix=suffix ;use last byte for prefix
suffix=getbyte()
if (suffix==-1) {
output(prefix)
goto end
}
check:
for (a=258;a<code;a++) {
if prefix==hash.pre[a] && suffix==hash.suf[a] {
prefix=a
suffix=getbyte()
goto check
}
}
;a new string is found
hash.pre[code]=prefix
hash.suf[code]=suffix
output(prefix)
goto goto1
end:
And that's it. You'll probably need an example to see what is exactly
happening so here is one:
Input data = "abcabcaabcd" ;just some random letters
1st step: prefix=a, suffix=b, output(prefix)
Output data = "a"
Hash table: PRE SUX
102h : a b
2nd step: prefix=suffix which is b, suffix=c, output(prefix)
Output data = "ab"
Hash table: PRE SUX
102h : a b
103h : b c
3rd step: prefix=c, suffix=a, output(prefix)
Output data = "abc"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
4th step: prefix=a, suffix=b, BUT this is already in the table
so prefix=102h which is the code
in the hash table with the string
and then suffix=c (the next code)
and this is not in the hash table
output(prefix)
Output data = "abc(102h)"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c ;if this is expanded it will be "abc"
5th step: prefix=c, suffix=a, output(prefix)
Output data = "abc(102h)c"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c
106h : c a
6th step: prefix=a, suffix=a, output(prefix)
Output data = "abc(102h)ca"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c
106h : c a
107h : a a
7th step: prefix=a, suffix=b BUT this is already in the hash table
prefix=102h, suffix=c BUT this is too
prefix=105h, suffix=d ok this is a new string
output(prefix)
Output data = "abc(102h)ca(105h)"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c
106h : c a
107h : a a
108h : 105h d
8th step: we are out of data so we clean up
output(suffix) ;'d'
output(100h) ;clear code
Final Output data = "abc(102h)ca(105h)d"
So we compressed 11 bytes (88bits) into 8 codes. But each code is 9bit giving
us 72 bits. If you look at the hash table you can easily see that our
output data is running down the PREfix collumn except for the last code
in the SUX (suffix) location.
start:
oc=getbyte()
output(oc)
code=102h
restart:
nc=getbyte()
suffix=outstr(nc) ;outstr will output nc and return the 1st
; char of the string
prefix=oc
hash.pre[code]=prefix
hash.sux[code]=suffix
oc=nc
code=code+1
goto restart
Of course once again that is very basic and does not check for EOF or
anything else. Let's try using the data from the compression demo.
Input data = "abc(102h)ca(105h)d" ;from the final compressed data
1st step: oc=a, output(oc)
Output data = "a"
2nd step: nc=b, suffix=b (outstr(nc)), prefix=a(oc), oc=b(nc)
Output data = "ab"
And after each step we place prefix and suffix into out hash table
Hash table: PRE SUX
102h : a b
3rd step: nc=c, suffix=c (outstr(nc)), prefix=b(oc), oc=c(nc)
Output data = "abc"
Hash table: PRE SUX
102h : a b
103h : b c
4th step: nc=102h, suffix=a (outstr(nc)) ;the 1st char of 102h is 'a'
prefix=c(oc), oc=102h(nc)
Output data = "abcab"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
5th step: nc=c, suffix=c (outstr(nc)), prefix=102h(oc), oc=c(nc)
Output data = "abcabc"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c
6th step: nc=a, suffix=a (outstr(nc)), prefix=c(oc), oc=a(nc)
Output data = "abcabca"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c
106h : c a
7th step: nc=105h, suffix=a (outstr(nc)), prefix=a(oc), oc=105h(nc)
Output data = "abcabcaabc"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c
106h : c a
107h : a a
8th step: nc=d, suffix=d (outstr(nc)), prefix=105h(oc), oc=d(nc)
Output data = "abcabcaabcd"
Hash table: PRE SUX
102h : a b
103h : b c
104h : c a
105h : 102h c
106h : c a
107h : a a
108h : 105h d
Final output data = "abcabcaabcd" ;which is what is started with from the
;very begining
Well that is for that, there is just one more major thing. Try to compress
this data and then decompress it:1st step:oc=a, output(oc) Output data = "a" 2nd step: nc=102h, suffix=a (outstr(nc)), prefix=a(oc), oc=102h(nc)But the second is very tricky, here nc will be 102h and then we use outstr() to output nc. But 102h is not yet in the hash table, why? Well remember when I showed you that the suffix of a code is the same as the 1st char of the next string. Well when we regenerate the hash table here we really are just using the 1st char of the last string to find out the suffix and if the nc is the code we are just trying to create then there is a problem. But the solution is easy, in this case all we do is let nc equal oc (just for the duration of the output() function, we do not litterilly assign oc to nc) and then we output suffix once after then output(nc) call. And this fixes our problem. This is because the 1st and last char of the next string is the same. If this is confusing well take a look at the rest of the output process:
2nd step: nc=102h, suffix=a (output(nc)) BUT nc is not yet defined so
we use do this instead
suffix=output(oc), output(suffix)
prefix=a(oc), oc=102h(nc)
Output data = "aaa"
Hash table: PRE SUX
102h : a a
3rd step: nc=a, suffix=a (output(nc)), prefix=102h(oc), oc=a(nc)
Output data = "aaaa"
Hash table: PRE SUX
102h : a a
103h : 102h a
Final Output data = "aaaa"
And that's it, here's the final code:
unsigned char *input_data;
int bufip; //BUFfer Input Pos
int bo; //Bit Offset
int getbyte(void) {
;get data must get data of size 9,10,11 or 12 bits
;This is difficult to do in C, just make sure you output only 9,10,11
;or 12bit at a time. Here is what it might look like
int a;
int b;
int dat;
int ret;
ret=0; //to be returned
dat=input_data[bufip];
for (a=0;a>=bo; //get next bit
b&=1; //mask bit #1
ret<<=1;
ret|=b;
if (bo!=0) bo--;
else {
bo=7;
dat=input_data[++bufip];
}
}
return ret;
}
int bufop; //BUFfer Output Pos
unsigned char *output_data;
output(unsigned char dat) {
output_data[bufop++]=dat;
}
int code;
outstr(int dat) {
;this is very difficult to implement in pseudo-code, it requires
;some sort of stack. Check out the ASM file at the end of the file
}
start:
code=258
bo=7 ;start at bit 7
bufop=0 ;start output buffer at 0
bufip=0 ;start input buffer at 0
output_data[0]=0;
oc=getbyte()
output(oc)
restart:
nc=getbyte()
suffix=outstr(nc) ;outstr will output nc and return the 1st
; char of the string
prefix=oc
hash.pre[code]=prefix
hash.sux[code]=suffix
oc=nc
code=code+1
goto restart
Well that's as best as I can do in 2 days. That final code is not really
that great but here is the full thing: