Skip to content

opus_read fails to extract CCMatrix #32

@Waino

Description

@Waino

I tried to extract the aligned sentence pairs from CCMatrix, previously downloaded using opus_express. The command I used was

opus_read --source en --target fi --directory CCMatrix --preprocess xml --leave_non_alignments_out --write_mode moses --write CCMatrix.raw.en CCMatrix.raw.fi --write_ids CCMatrix.raw.ids

The command runs for several days at 100% CPU, without producing any output. Perhaps expat is choking on some error in the data. To rule out package corruption after download, I allowed opus_read to download it again, with the same hanging result.

Traceback when killed:
  File "/home/stiggronroos/venvs/opustools/bin/opus_read", line 135, in <module>
    OpusRead(**vars(args)).printPairs()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/opus_read.py", line 214, in printPairs
    self.alignmentParser.collect_links()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/alignment_parser.py", line 107, in collect_links
    blocks = self.bp.get_complete_blocks()
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 98, in get_complete_blocks
    self.parse_line(line)
  File "/home/stiggronroos/venvs/opustools/lib/python3.6/site-packages/opustools/parse/block_parser.py", line 82, in parse_line
    self.p.Parse(line)
KeyboardInterrupt

Workaround: (re)download the corpus directly in moses format from https://opus.nlpl.eu/CCMatrix.php

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions