Skip to content

hplt-project/cc-download

Repository files navigation

Helper scripts for downloading a list of Common Crawls

Installation

Install aria2c and make sure aria2c is in your PATH.

Usage

Download files containing lists of URLs for all crawls from cc_list.txt:

./get_url_lists.sh

Select file lists you want to download files from. For instance, we sampled 10 random CC crawls to download:

ls warcpaths/|shuf -n 10 >cc_paths_shuffled_first10.txt

Generate download tasks for Aria2:

./generate_aria_input.sh cc_paths_shuffled_first10.txt >first10_warcs.lst

Run Aria2:

./run_aria.sh first10_warcs.lst

When finished or interrupted Aria2 will save unfinished downloads back to first10_warcs.lst. Rerun the last command to re-run downloading unfinished files.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages