-
Notifications
You must be signed in to change notification settings - Fork 14
Generating a corpus
The first step is to get a CommonCrawl dump that will be used as a source for the corpus generation pipeline.
-
Choose a CommonCrawl release here, usually named MONTH YEAR crawl archive now available
-
On the release post, download the
wet.paths.gz
file. This contains paths to the shards. Ungoliant internally contains the baseurl, which will be added at runtime. -
Decompress:
gzip -d wet.paths.gz wet_files
-
Run
ungoliant download wet.paths
. This should take a while depending on your connection (days). Be sure to have the required disk size (~8To).Note: If you need a smaller CC dump, remove some lines from the
wet.paths
file. Ex:head -n 200 wet.paths > wet_tiny.paths
.
Auxiliary resources include the model for language identification (mandatory) and the UT1 blocklist for adult content annotation.
- Fetch fastText's language identification binary here, named
lid.176.bin
- (Optional:) Fetch a copy of the UT1 Blocklist and decompress it.
Running the pipeline on 64,000 shards takes between 2 and 3 days on a 64 Thread, 180G HPC. As of March, 8 2022 there is no progress indicator.
-
Run
ungoliant pipeline <src> <dst> --lid-path <path_to_lid> --blocklist-path <path_to_blocklist_folder>
.Note: blocklist folder is not
adult
, but ratherblacklists
.
Now you should have a corpus generated in the dst
folder!
The corpus itself should be comprised of .jsonl
files of various size, from some kilobytes to 2-3 TB for the biggest languages. However, sharing these files "as-is" is complicated (no compression, no splitting).
The oscar-tools
contains splitting and compression operations that should help with distributing the corpus.
Note: As of now, oscar-tools
is not yet ready for production and is not present in the crates.io repository.
- Get the tool:
cargo install https://github.com/oscar-corpus/oscar-tools.git
- Split the corpus into smaller, easily shareable splits:
oscar-tools v2 split path_to_corpus/ destination/
. Default split size is500MB
. Useoscar-tools v2 split --help
to see available options. Be aware that this will copy the entire corpus and take the same size. This should take at most some hours. - Compress the generated chunks:
oscar-tools v2 compress path_to_splitted/ destination/
. Use--del-src
to delete source files as they are compressed. You might need to specify-J
manually in order to limit the number of threads due to important RAM usage during compression. - Generate checksum files:
oscar-tools v2 checksum path_to_compressed
. Same warning about-J
.
You should get a folder containing folders by language. Each language folder should contain the compressed splits along with a lang_sha256.sum
file that is used to check splits integrity. Use it with sha256sum -c lang_sha256.sum
.