IBM®
Skip to main content
    United States change      Terms of use
 
 
Select a scope:    
     Home      Products      Services & industry solutions      Support & downloads      My account     
alphaWorks  >  Information management  >  

IBM Hash Suffix Array Delta Compression

A new differential compression algorithm that combines the hash value and suffix array techniques.


Date Posted: November 27, 2007
OverviewRequirements Download FAQs Forum Reviews

What is IBM Hash Suffix Array Delta Compression?

This new differential compression algorithm combines the hash value techniques and suffix array techniques of previous work. Differential compression refers to encoding a file (called a version file) as a set of changes with respect to another file (a reference file). Previous differential compression algorithms can be shown empirically to run in linear time, but they have certain drawbacks: They do not find the best matches for every offset of the version file. This algorithm finds the best matches for every offset of the version file with respect to a certain level of detail and above a certain length threshold.

The algorithm has two variations that depend on how we choose the block size. If the block size is kept fixed, the compression performance of this algorithm is similar to that of the greedy algorithm, without the expensive space and time requirements. If the block size is varied linearly with the reference file size, this algorithm can run in linear time and constant space. It can also be shown empirically that this algorithm is faster and more compressed than certain differential compression algorithms.

How does it work?

This technology's utility sources are -c89 standard compatible. IBM® Hash Suffix Array Delta Compression is compiled for Windows®, Linux®, and Solaris systems. It is believed that, with little or no modification, it can also be compiled for other systems upon request.

The utility uses bzip2 (1.0.2) as the second-level compression algorithm.

Further information is available in the following paper: An approximation to the greedy algorithm for differential compression.

The following code shows how to run this algorithm:

On Linux and cygwin:

    ./hsadelta -[c|d] -r <reffilename> -v <verfilename> -o <deltafilename> [-m maxMem]
    

    [c|d]: c = compress, d = decompress (reconstruct)
    <MaxMem>: in MB, default 200 (optional)

On Windows:
    /hsadelta  [c|d] -r <reffilename> -v <verfilename> -o <deltafilename> [-m maxMem]
    

    [c|d]: c = compress, d = decompress (reconstruct)
    <MaxMem>: in MB, default 200 (optional)
    On Windows, there is no minus sign (-) before "c|d".

Example (on Linux):
    reference file: linux-2.2.26.tar
    version file: linux-2.6.22.9.tar

    To compress:

    ./hsadelta -c -r linux-2.2.26.tar -v linux-2.6.22.9.tar -o linux-delta
    

    To reconstruct:

    ./hsadelta -d -r linux-2.2.26.tar -v linux-2.6.22.9-reconstruct.tar -o linux-delta
    


About the technology author(s):
Karan Gupta is a software engineer at IBM Research in San Jose, CA. He can be reached through e-mail.

Ramesh C. Agarwal is an IBM Fellow Emeritus at IBM Research in San Jose, CA. He can be reached through e-mail.


Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
IBM is a trademark of IBM Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.

Download now Download now

Related technologies

For platform(s):
Intel Linux 2.2.12, Intel Linux 2.2.14, Intel Linux 2.4

For topics:
data management, utilities


 

    About IBM Privacy Contact