GitHub - openhive-network/compression_dictionaries

Block Log Compression Dictionaries

This repo contains dictionaries used for compressing blocks. Since blocks are relatively short and tend to have significant amounts of similar data from block to block, using custom compression dictionaries significantly improves compression. In testing, we found a ~4.5% improvement in compression ratio using the custom dictionaries. At the current block log size, that saves about 25GB.

Since the contents of blocks (active accounts, popular operations, etc) varies over time, we have a different pre-computed dictionary for each million blocks (1M blocks is roughly one month worth).

These dictionaries are stored in a submodule to reduce the size of the main hive repo. The dictionaries currently consume about 9MB, but they are only useful for the hive mainnet. The testnet builds or other chains that don't have the same blocks as hive will not get much benefit from these dictionaries, so the testnet can be built without them.

Procedure for generating dictionaries

It's assumed that:

existing dictionaries are never changed, so you'll always be able to decode your old block logs with newer builds of hived
each time a new major release of Hive is made, we'll generate new optimal dictionaries for the millions of blocks generated since the previous major release, and they will be added to this repository

Extract a million blocks

Use the compress_block_log utility to extact each block in the range you're interested in and store it in a separate file. Run something like:

rm -r /tmp/blockchain
compress_block_log --decompress --input-block-log=/storage1/datadir/blockchain --output-block-log=/tmp/blockchain --starting-block-number=60000000 --block-count=1000000 --dump-raw-blocks=/tmp/blocks

This will create one million files on your disk, each containing a block. They'll be laid out with one directory per million files. Inside the directories, the files will be split by the hundred-thousand to be kind to the filesystem.

/tmp/blocks/60000000/60000000/60000000.bin
/tmp/blocks/60000000/60000000/60000001.bin
...
/tmp/blocks/60000000/60000000/60099999.bin
/tmp/blocks/60000000/60100000/60100000.bin

Take a random sample of blocks

The tool for creating dictionaries can only process up to 2GB of input, so it's not possible to feed it a full million blocks. We'll randomly choose blocks that add up to 2GB total, and create a dictionary based on those.

utils/sample_blocks.py /tmp/blocks/60000000 /tmp/blocks-randomsample/60000000 2147483648

Compute a dictionary from the sample

We've settled on using 220K dictionaries optimized for compression level 15, generated thus:

zstd -T0 -r --train-fastcover /tmp/blocks-randomsample/60000000 -15 --maxdict=225280 -o /tmp/dictionaries/220K/060M.dict --dictID=60

Compress the dictionary

zstd -v --ultra -22 /tmp/dictionaries/220K/060M.dict

Update CMakeLists.txt

Update the CMakeLists.txt to reference the new last dictionary number you just added

FOREACH(DICTIONARY_NUM RANGE 60)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
utils		utils
000M.dict.zst		000M.dict.zst
001M.dict.zst		001M.dict.zst
002M.dict.zst		002M.dict.zst
003M.dict.zst		003M.dict.zst
004M.dict.zst		004M.dict.zst
005M.dict.zst		005M.dict.zst
006M.dict.zst		006M.dict.zst
007M.dict.zst		007M.dict.zst
008M.dict.zst		008M.dict.zst
009M.dict.zst		009M.dict.zst
010M.dict.zst		010M.dict.zst
011M.dict.zst		011M.dict.zst
012M.dict.zst		012M.dict.zst
013M.dict.zst		013M.dict.zst
014M.dict.zst		014M.dict.zst
015M.dict.zst		015M.dict.zst
016M.dict.zst		016M.dict.zst
017M.dict.zst		017M.dict.zst
018M.dict.zst		018M.dict.zst
019M.dict.zst		019M.dict.zst
020M.dict.zst		020M.dict.zst
021M.dict.zst		021M.dict.zst
022M.dict.zst		022M.dict.zst
023M.dict.zst		023M.dict.zst
024M.dict.zst		024M.dict.zst
025M.dict.zst		025M.dict.zst
026M.dict.zst		026M.dict.zst
027M.dict.zst		027M.dict.zst
028M.dict.zst		028M.dict.zst
029M.dict.zst		029M.dict.zst
030M.dict.zst		030M.dict.zst
031M.dict.zst		031M.dict.zst
032M.dict.zst		032M.dict.zst
033M.dict.zst		033M.dict.zst
034M.dict.zst		034M.dict.zst
035M.dict.zst		035M.dict.zst
036M.dict.zst		036M.dict.zst
037M.dict.zst		037M.dict.zst
038M.dict.zst		038M.dict.zst
039M.dict.zst		039M.dict.zst
040M.dict.zst		040M.dict.zst
041M.dict.zst		041M.dict.zst
042M.dict.zst		042M.dict.zst
043M.dict.zst		043M.dict.zst
044M.dict.zst		044M.dict.zst
045M.dict.zst		045M.dict.zst
046M.dict.zst		046M.dict.zst
047M.dict.zst		047M.dict.zst
048M.dict.zst		048M.dict.zst
049M.dict.zst		049M.dict.zst
050M.dict.zst		050M.dict.zst
051M.dict.zst		051M.dict.zst
052M.dict.zst		052M.dict.zst
053M.dict.zst		053M.dict.zst
054M.dict.zst		054M.dict.zst
055M.dict.zst		055M.dict.zst
056M.dict.zst		056M.dict.zst
057M.dict.zst		057M.dict.zst
058M.dict.zst		058M.dict.zst
059M.dict.zst		059M.dict.zst
060M.dict.zst		060M.dict.zst
061M.dict.zst		061M.dict.zst
062M.dict.zst		062M.dict.zst
063M.dict.zst		063M.dict.zst
064M.dict.zst		064M.dict.zst
065M.dict.zst		065M.dict.zst
066M.dict.zst		066M.dict.zst
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Block Log Compression Dictionaries

Procedure for generating dictionaries

Extract a million blocks

Take a random sample of blocks

Compute a dictionary from the sample

Compress the dictionary

Update CMakeLists.txt

About

Releases

Packages

Contributors 3

Languages

openhive-network/compression_dictionaries

Folders and files

Latest commit

History

Repository files navigation

Block Log Compression Dictionaries

Procedure for generating dictionaries

Extract a million blocks

Take a random sample of blocks

Compute a dictionary from the sample

Compress the dictionary

Update CMakeLists.txt

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages