In the name of science and engineering

Wed 17 December 2008

In the inaugurial post in Aaron Toponce's series on compression, two critical errors are made and highlighted by a comment to his post:

"I wanted to see speed, and I figured what could be faster than a bunch of identical data? I was way wrong."

Aaron, you say you wanted to see speed, but didn't tell the programs that. Instead you asked for maximal compression. It should come as no surprise that -1 through -9 are not calibrated among compression tools. -9 means "I don't care how much the CPU costs over -8, find me those extra bytes." You really shouldn't be surprised when an algorithm can take gobs of time when asked to do so.

Your other mistake is about datasets. Lossless compression on completely random data is held to be impossible; certainly the developers of bzip2, gzip and lzma aren't after that goal. Each of these algorithms is designed to work on some datasets better than others. We shouldn't expect the Burrows-Wheeler transform to be a worthwhile investment on megabytes of zeros, for example (though I still have a hard time believing it works at all, on anything).

In fact, the bzip2 implementation may be saved by what the bzip2 author calls a mistake:

The run-length encoder, which is the first of the compression transformations, is entirely irrelevant. The original purpose was to protect the sorting algorithm from the very worst case input: a string of repeated symbols. But algorithm steps Q6a and Q6b in the original Burrows-Wheeler technical report (SRC-124) show how repeats can be handled without difficulty in block sorting.

(emphasis and link mine)

Still, I applaud your efforts to investigate and document, even if I think conclusions should be withheld pending further investigation. It's likely my own investigation was flawed in some way, but I don't think it's worth peer review at this point: GoodMerge publishes it's own findings.

It might be informative to:

  1. Publish your datasets. This may involve changing them slightly for privacy and copyright ;)

  2. Graph the relationship between CPU time, compression ratio and -# option, to illustrate the size of tradeoffs available.

Comments !