You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 4, 2021. It is now read-only.
For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:
the downloading
the extraction of the candidate text from HTML
the text processing (including the external text processor
the saving of the text in BASE 64 encoding
Example command line:
nohup cat candidates.en-es.locations | ~/DataCollection/baseline/candidates2corpus.py -source_splitter='/scripts/ems/support/split-sentences.perl -l en -b -q' -target_splitter='/scripts/ems/support/split-sentences.perl -l es -b -q' 2> candidates2corpus.log > en-es.down &
Profile code with 10s to 100s of candidate pairs.
The text was updated successfully, but these errors were encountered:
Ran the python profiler cProfile on the first 100 candidates from the 2015_32 en_es data collection. These are the percentages of the cumulative time for the above steps:
54% for downloading
33% extraction of candidate text from HTML
10% text processing/tokenization
<1% saving the text in BASE64 - doesn't register in the top-100 routines sorted by time
This was run in the AWS us-east-1 region where the CommonCrawl data is located as well.
So downloading the content does take the majority of the time, however about 44% are spent to extract the text from HTML and tokenize it. Some avenues to investigate:
running the code in parallel (e.g. with GNU parallel) - when do the network connections get saturated
separating the downloading from the extraction/processing - this would also offer the flexibility to change these two parts independently
download the HTML from the meta-data service instead of the CommonCrawl WARC files
After some investigation: running the code in parallel with gnu parallel doesn't work because the input file has input records (page pairs) divided over two lines - these can get separated across different input blocks with gnu parallel and the data from the separated records cannot be downloaded/aligned. So separating the downloading from the extracting/processing and making them both parallelizable seems to be the best avenue. Also to separate network-bound loads from CPU-bound processing which then can be optimized separately.
BTW - the last avenue described above to download HTML from the meta-data service is not advisable as it would make the downloading of parallel corpora dependent on the availability of the meta-data service.
Investigated options to enable parallel downloading:
Separate downloading from extraction/processing as described above and then parallelize with GNU parallel
Use aiohttp Python module to enable parallel downloading with multiple threads; the actual downloading is embedded deep in ccdownloader.py, so need to check on side-effects
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:
Example command line:
Profile code with 10s to 100s of candidate pairs.
The text was updated successfully, but these errors were encountered: