-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
I tried to extract the aligned sentence pairs from CCMatrix, previously downloaded using opus_express. The command I used was
opus_read --source en --target fi --directory CCMatrix --preprocess xml --leave_non_alignments_out --write_mode moses --write CCMatrix.raw.en CCMatrix.raw.fi --write_ids CCMatrix.raw.ids
The command runs for several days at 100% CPU, without producing any output. Perhaps expat is choking on some error in the data. To rule out package corruption after download, I allowed opus_read to download it again, with the same hanging result.
Traceback when killed:
File "/home/stiggronroos/venvs/opustools/bin/opus_read", line 135, in <module>
OpusRead(**vars(args)).printPairs()
File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/opus_read.py", line 214, in printPairs
self.alignmentParser.collect_links()
File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/alignment_parser.py", line 107, in collect_links
blocks = self.bp.get_complete_blocks()
File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 98, in get_complete_blocks
self.parse_line(line)
File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 82, in parse_line
self.p.Parse(line)
KeyboardInterrupt
Workaround: (re)download the corpus directly in moses format from https://opus.nlpl.eu/CCMatrix.php
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels