Skip to content

Generating a corpus

Julien "uj" Abadji edited this page Mar 8, 2022 · 2 revisions

Gather a CommonCrawl dump.

The first step is to get a CommonCrawl dump that will be used as a source for the corpus generation pipeline.

  1. Choose a CommonCrawl release here, usually named MONTH YEAR crawl archive now available

  2. On the release post, download the wet.paths.gz file. This contains paths to the shards. Ungoliant internally contains the baseurl, which will be added at runtime.

  3. Decompress: gzip -d wet.paths.gz wet_files

  4. Run ungoliant download wet.paths. This should take a while depending on your connection (days). Be sure to have the required disk size (~8To).

    Note: If you need a smaller CC dump, remove some lines from the wet.paths file. Ex: head -n 200 wet.paths > wet_tiny.paths.

Gather auxiliary resources

Auxiliary resources include the model for language identification (mandatory) and the UT1 blocklist for adult content annotation.

  1. Fetch fastText's language identification binary here, named lid.176.bin
  2. (Optional:) Fetch a copy of the UT1 Blocklist and decompress it.

Run the pipeline

Running the pipeline on 64,000 shards takes between 2 and 3 days on a 64 Thread, 180G HPC. As of March, 8 2022 there is no progress indicator.

  1. Run ungoliant pipeline <src> <dst> --lid-path <path_to_lid> --blocklist-path <path_to_blocklist_folder>.

    Note: blocklist folder is not adult, but rather blacklists.

Now you should have a corpus generated in the dst folder!

Prepare for distribution

The corpus itself should be comprised of .jsonl files of various size, from some kilobytes to 2-3 TB for the biggest languages. However, sharing these files "as-is" is complicated (no compression, no splitting).

The oscar-tools contains splitting and compression operations that should help with distributing the corpus. Note: As of now, oscar-tools is not yet ready for production and is not present in the crates.io repository.

  1. Get the tool: cargo install https://github.com/oscar-corpus/oscar-tools.git
  2. Split the corpus into smaller, easily shareable splits: oscar-tools v2 split path_to_corpus/ destination/. Default split size is 500MB. Use oscar-tools v2 split --help to see available options. Be aware that this will copy the entire corpus and take the same size. This should take at most some hours.
  3. Compress the generated chunks: oscar-tools v2 compress path_to_splitted/ destination/. Use --del-src to delete source files as they are compressed. You might need to specify -J manually in order to limit the number of threads due to important RAM usage during compression.
  4. Generate checksum files: oscar-tools v2 checksum path_to_compressed. Same warning about -J.

You should get a folder containing folders by language. Each language folder should contain the compressed splits along with a lang_sha256.sum file that is used to check splits integrity. Use it with sha256sum -c lang_sha256.sum.

Clone this wiki locally