Date Posted: November 27, 2007
What is IBM Hash Suffix Array Delta Compression?
This new differential compression algorithm combines the hash value techniques and suffix array techniques of previous work. Differential compression refers to encoding a file (called a version file) as a set of changes with respect to another file (a reference file). Previous differential compression algorithms can be shown empirically to run in linear time, but they have certain drawbacks: They do not find the best matches for every offset of the version file. This algorithm finds the best matches for every offset of the version file with respect to a certain level of detail and above a certain length threshold.
The algorithm has two variations that depend on how we choose the block size. If the block size is kept fixed, the compression performance of this algorithm is similar to that of the greedy algorithm, without the expensive space and time requirements. If the block size is varied linearly with the reference file size, this algorithm can run in linear time and constant space. It can also be shown empirically that this algorithm is faster and more compressed than certain differential compression algorithms.
How does it work?
This technology's utility sources are -c89 standard compatible. IBM® Hash Suffix Array Delta Compression is compiled for Windows®, Linux®, and Solaris systems. It is believed that, with little or no modification, it can also be compiled for other systems upon request.
The utility uses bzip2 (1.0.2) as the second-level compression algorithm.
Further information is available in the following paper: An approximation to the greedy algorithm for differential compression.
The following code shows how to run this algorithm:
- On Linux and cygwin:
./hsadelta -[c|d] -r <reffilename> -v <verfilename> -o <deltafilename> [-m maxMem]
[c|d]: c = compress, d = decompress (reconstruct)
<MaxMem>: in MB, default 200 (optional) - On Windows:
/hsadelta [c|d] -r <reffilename> -v <verfilename> -o <deltafilename> [-m maxMem]
[c|d]: c = compress, d = decompress (reconstruct)
<MaxMem>: in MB, default 200 (optional)
On Windows, there is no minus sign (-) before "c|d".
Example (on Linux):
- reference file: linux-2.2.26.tar
- version file: linux-2.6.22.9.tar
./hsadelta -c -r linux-2.2.26.tar -v linux-2.6.22.9.tar -o linux-delta
To reconstruct:
./hsadelta -d -r linux-2.2.26.tar -v linux-2.6.22.9-reconstruct.tar -o linux-delta
