Skip to content
Elsa Gonsiorowski, PhD edited this page May 8, 2019 · 8 revisions

distributed bzip2

bzip2 is a file compression algorithm which has been adapted for the distributed functionality. dbz2 was originally developed by Ahana Roy Choudhury in 2017, updated by Adam Moody in early 2019, and released as part of v0.9.1.

Functionality

The archives created by dbz2 are very similar to those generated by bzip2, except for the addition of a some metadata at the end of the file which includes info about the location of all of the compressed blocks in the file. This facilitates easier parallelism when decompressing. It turns out that bunzip2 will successfully decompress one of our .dbz2 files, but it throws a non-fatal warning when it encounters our footer. If a user absolutely needs a pure .bz2 file, we could write a tool to truncate our footer off the file, or we could add an option to avoid recording the footer in the first place.

The dbz2 tool cannot decompress a pure .bz2 file.

Implementation

dbz2 has two work balancing algorithms (which work for both compression and decompression):

  1. A static algorithm which uses a round-robin assignment of file blocks to MPI ranks;
  2. A dynamic algorithm which comes from libcircle.

As of v0.9.1, the static algorithm is the default.

Performance

Speed

Data comparing bzip2 and various scales of dbz2 for compressing a 4 GB file. These results are reading and writing to a single lustre stripe, and I think we’re bandwidth bound on the file system at this point.

Tool Time (s) Scale
bzip2 297 serial
dbz2 294 1 proc, 1 node
dbz2 150 2 procs, 1 node
dbz2 78 4 procs, 1 node
dbz2 45 8 procs, 1 node
dbz2 28 16 procs, 1 node
dbz2 16 32 procs, 1 node
dbz2 13 64 procs, 2 nodes
dbz2 10 128 procs, 4 nodes
dbz2 9 256 procs, 8 nodes

Compression Ratio

For compression ratio on this example, we are in the same ballpark as a pure .bz2 file:

Size (B) File
249162312 dbz2 archive
248218592 bz2 archive
4425172384 original

Future work

Update the dbz2 file format

dbz2 makes slight modifications to the standard bzip2 file format by attaching some additional metadata at the end of the file. We could explore other options for the compressed block metadata, such as creating a second file for each compressed file. Alternatively, we could attach the metadata to the file as extended attributes, but that’s not really portable across file systems.

Optimization for Lustre

Some parts of the implementation could be better optimized to avoid contending on lustre byte ranges when writing the file.

Clone this wiki locally