Script candidates2corpus.py needs days to run for large language pairs #6

achimr · 2016-12-01T20:16:27Z

For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:

the downloading
the extraction of the candidate text from HTML
the text processing (including the external text processor
the saving of the text in BASE 64 encoding

Example command line:

nohup cat candidates.en-es.locations | ~/DataCollection/baseline/candidates2corpus.py -source_splitter='/scripts/ems/support/split-sentences.perl -l en -b -q' -target_splitter='/scripts/ems/support/split-sentences.perl -l es -b -q'  2> candidates2corpus.log > en-es.down &

Profile code with 10s to 100s of candidate pairs.

The text was updated successfully, but these errors were encountered:

achimr · 2016-12-21T21:11:00Z

Ran the python profiler cProfile on the first 100 candidates from the 2015_32 en_es data collection. These are the percentages of the cumulative time for the above steps:

54% for downloading
33% extraction of candidate text from HTML
10% text processing/tokenization
<1% saving the text in BASE64 - doesn't register in the top-100 routines sorted by time

This was run in the AWS us-east-1 region where the CommonCrawl data is located as well.

So downloading the content does take the majority of the time, however about 44% are spent to extract the text from HTML and tokenize it. Some avenues to investigate:

running the code in parallel (e.g. with GNU parallel) - when do the network connections get saturated
separating the downloading from the extraction/processing - this would also offer the flexibility to change these two parts independently
download the HTML from the meta-data service instead of the CommonCrawl WARC files

achimr · 2017-01-05T21:42:57Z

After some investigation: running the code in parallel with gnu parallel doesn't work because the input file has input records (page pairs) divided over two lines - these can get separated across different input blocks with gnu parallel and the data from the separated records cannot be downloaded/aligned. So separating the downloading from the extracting/processing and making them both parallelizable seems to be the best avenue. Also to separate network-bound loads from CPU-bound processing which then can be optimized separately.

BTW - the last avenue described above to download HTML from the meta-data service is not advisable as it would make the downloading of parallel corpora dependent on the availability of the meta-data service.

achimr · 2017-08-18T14:43:14Z

It also seems unnecessary to extract text from the HTML in the WARC files as the plain text is already available in the WET files http://commoncrawl.org/the-data/get-started/

achimr · 2017-11-28T19:49:36Z

Investigated options to enable parallel downloading:

Separate downloading from extraction/processing as described above and then parallelize with GNU parallel
Use aiohttp Python module to enable parallel downloading with multiple threads; the actual downloading is embedded deep in ccdownloader.py, so need to check on side-effects

achimr self-assigned this Dec 1, 2016

achimr added the enhancement label Aug 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script candidates2corpus.py needs days to run for large language pairs #6

Script candidates2corpus.py needs days to run for large language pairs #6

achimr commented Dec 1, 2016

achimr commented Dec 21, 2016

achimr commented Jan 5, 2017

achimr commented Aug 18, 2017

achimr commented Nov 28, 2017

Script candidates2corpus.py needs days to run for large language pairs #6

Script candidates2corpus.py needs days to run for large language pairs #6

Comments

achimr commented Dec 1, 2016

achimr commented Dec 21, 2016

achimr commented Jan 5, 2017

achimr commented Aug 18, 2017

achimr commented Nov 28, 2017