Skip to content
This repository has been archived by the owner on May 4, 2021. It is now read-only.

Script candidates2corpus.py needs days to run for large language pairs #6

Open
achimr opened this issue Dec 1, 2016 · 4 comments
Open
Assignees

Comments

@achimr
Copy link
Contributor

achimr commented Dec 1, 2016

For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:

  1. the downloading
  2. the extraction of the candidate text from HTML
  3. the text processing (including the external text processor
  4. the saving of the text in BASE 64 encoding

Example command line:

nohup cat candidates.en-es.locations | ~/DataCollection/baseline/candidates2corpus.py -source_splitter='/scripts/ems/support/split-sentences.perl -l en -b -q' -target_splitter='/scripts/ems/support/split-sentences.perl -l es -b -q'  2> candidates2corpus.log > en-es.down &

Profile code with 10s to 100s of candidate pairs.

@achimr achimr self-assigned this Dec 1, 2016
@achimr
Copy link
Contributor Author

achimr commented Dec 21, 2016

Ran the python profiler cProfile on the first 100 candidates from the 2015_32 en_es data collection. These are the percentages of the cumulative time for the above steps:

  1. 54% for downloading
  2. 33% extraction of candidate text from HTML
  3. 10% text processing/tokenization
  4. <1% saving the text in BASE64 - doesn't register in the top-100 routines sorted by time

This was run in the AWS us-east-1 region where the CommonCrawl data is located as well.

So downloading the content does take the majority of the time, however about 44% are spent to extract the text from HTML and tokenize it. Some avenues to investigate:

  • running the code in parallel (e.g. with GNU parallel) - when do the network connections get saturated
  • separating the downloading from the extraction/processing - this would also offer the flexibility to change these two parts independently
  • download the HTML from the meta-data service instead of the CommonCrawl WARC files

@achimr
Copy link
Contributor Author

achimr commented Jan 5, 2017

After some investigation: running the code in parallel with gnu parallel doesn't work because the input file has input records (page pairs) divided over two lines - these can get separated across different input blocks with gnu parallel and the data from the separated records cannot be downloaded/aligned. So separating the downloading from the extracting/processing and making them both parallelizable seems to be the best avenue. Also to separate network-bound loads from CPU-bound processing which then can be optimized separately.

BTW - the last avenue described above to download HTML from the meta-data service is not advisable as it would make the downloading of parallel corpora dependent on the availability of the meta-data service.

@achimr
Copy link
Contributor Author

achimr commented Aug 18, 2017

It also seems unnecessary to extract text from the HTML in the WARC files as the plain text is already available in the WET files http://commoncrawl.org/the-data/get-started/

@achimr
Copy link
Contributor Author

achimr commented Nov 28, 2017

Investigated options to enable parallel downloading:

  1. Separate downloading from extraction/processing as described above and then parallelize with GNU parallel
  2. Use aiohttp Python module to enable parallel downloading with multiple threads; the actual downloading is embedded deep in ccdownloader.py, so need to check on side-effects

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant