Skip to content

Latest commit

 

History

History
52 lines (37 loc) · 1.39 KB

README.org

File metadata and controls

52 lines (37 loc) · 1.39 KB

cg-mwesplit

https://travis-ci.org/unhammer/cg-mwesplit.svg

Description

This program reads input in Constraint Grammar format, and splits special “multiword cohorts” into separate cohorts, leaving other cohorts and intervening blanks as they were.

For examples of input/output, see the files in test/, e.g. test/input.simple.cg.

Prerequisites

A C++ compiler that goes all the way to 11.

Tested with gcc-5.2.0, gcc-5.3.1 and clang-703.0.29.

(Should work all the way down to gcc-4.9, but will fail with e.g. gcc-4.8.4 or clang-3.5.0.)

Building

./autogen.sh
./configure  # optionally with argument --prefix=$HOME/my/prefix
make
make install # with sudo if you didn't specify a prefix

Usage

Takes no options, just stdin and stdout:

cg-mwesplit < infile > outfile

More typically, it’ll be in a pipeline after hfst-tokenise and some step that disambiguates multiwords using vislcg3:

echo words go here | hfst-tokenise --gtd tokeniser.pmhfst | vislcg3 -g mwe-dis.cg3 | cg-mwesplit

Troubleshooting

If you get

terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error

then your C++ compiler is too old. See Prerequisites.