A command-line tool which can be used to export European Union’s DGT-Translation Memory (a multilingual corpus of EU’s legislative documents) from a collection of TMX documents into a single SQLite database.
The translation memory is distributed as a collection of ZIP files, each containing a set of TMX (Translation Memory eXchange) files, each corresponding to a EUR-Lex document. Translation units contain parallel texts in up to 24 languages.
With the Rust toolchain installed, build the binary from source:
git clone git@github.com:malinowskip/dgt_parser.git
cd dgt_parser
cargo build --release
The generated binary will be located at the following path: ./target/release/dgt_parser
.
The following command will create a db.sqlite
file and populate it with the translation units extracted from the zipped TMX files located in the input directory:
dgt_parser -i <INPUT_DIR> sqlite --output db.sqlite
To display the schema of the generated database, you can run the following command (assuming that SQLite is installed on your system):
sqlite3 db.sqlite ".schema"
The database will contain two tables: translation_units
and documents
. The latter is a list of source EU documents. Each translation unit belongs to a document, and the translation_units
table uses the document_id
column as the foreign key referencing the corresponding document id.
For convenience, each translation unit is assigned a sequential_number
, which is its consecutive number in the document it belongs to.
Using the generated SQLite database:
--- Simple query ---
SELECT en_gb, pl_01 FROM translation_units WHERE pl_01 LIKE '%jednakowoż%' LIMIT 5;
--- EXAMPLE (joining translation units with documents) ---
SELECT
tu.en_gb,
d.name
FROM translation_units tu
JOIN documents d ON tu.document_id = d.id
LIMIT 5;
Note: the FTS5 extension is required for this.
Note: in the following example, the search index will only contain entries in Polish and English.
CREATE VIRTUAL TABLE translation_units_fts USING fts5 (
document_id,
en_gb,
pl_01,
content=translation_units
);
INSERT INTO
translation_units_fts
SELECT
document_id,
en_gb,
pl_01
FROM
translation_units;
Querying the full-text search index:
SELECT * FROM translation_units_fts WHERE en_gb MATCH 'tamper evident' LIMIT 5;
--- Include the name/id of the source document ---
SELECT
d.name,
tu.en_gb,
tu.pl_01
FROM
translation_units_fts tu
JOIN documents d ON d.id = tu.document_id
WHERE en_gb MATCH 'heretofore'
LIMIT 5;
Parse all ZIP files in ./input_dir
and save all translation units in an SQLite database:
dgt_parser -i ./input_dir sqlite -o db.sqlite
Same as above, but only save phrases in Polish and in English, ignoring other languages. Additional language codes can be added by repeating the -l <LANG_CODE>
option.
dgt_parser -l pl -l en -i ./input_dir sqlite -o db.sqlite
Same as above, but only include the translation units that contain texts in all of the specified languages.
dgt_parser --require-each-lang -l pl -l en -i ./input_dir sqlite -o db.sqlite
TMX files stored in this repository have been downloaded from the official DGT-Translation Memory website and are the exclusive property of the European Commission.