bitext-aligner (Parallel Corpus Creation)

In the contemporary era of data-driven Natural Language Processing (NLP), Parallel Corpora has been a key resource in addressing the requirements of our multilingual society. In our project, we were motivated to create a parallel corpus that provided an accessible mapping of two language pairs - Russian-English and German-English. The scope of this, however, can be extended to several other languages.

Data forms the backbone of our corpus and its collection was the initial task that required us to experiment with several types of file formats. We have used the FictionBook 2.0 or simply, FB2 file format due to its XML parsability and compatibility with the standard XSD format. Furthermore, this format is designed for fictional literature that suited the very nature of our data.

In order to create an efficiently mapped corpus, we developed an aligner that identified and matched the corresponding units of the input text. These units in our project are sentences. Tokenization of these sentences is the first step that contributes towards the overall alignment algorithm. This is followed by their translation using Google’s NMT API. The core part of the aligner is the usage of Levenshtein distance for finding similarity between the original and translated sentences. Through an iterative process, we then find the best matches constrained within a window and perform the alignment.

The output from the aligner is then saved to multiple XML files. Recording the file path through JSON simplified the management of several stages through its addition to a database. We present our corpus as a web-page and therefore, we used XSLT to transform data from multiple XML files to an HTML file. While the XML/HTML formats enhance the readability of the project, they cannot act upon queries to retrieve information required by the user. Therefore, we used a SQL database (MySQL) to assist the user in querying.

There are various possible enhancements to our project. Through corpus extension, efficient alignment at a word level, the inclusion of a local NMT engine and noise reduction, our project can be amplified to a much greater extent. Furthermore, analysis of translation styles, language learning and paremiology are some of the areas that can witness the usage of our project.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
aligner		aligner
data		data
db		db
fb2_parser		fb2_parser
json		json
schema		schema
slides		slides
utils		utils
xml_files		xml_files
xml_parser		xml_parser
xslt		xslt
.gitignore		.gitignore
404.html		404.html
README.md		README.md
books_data.csv		books_data.csv
db_config.ini		db_config.ini
index.html		index.html
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bitext-aligner (Parallel Corpus Creation)

About

Releases

Packages

Contributors 3

Languages

pavan245/bitext-aligner

Folders and files

Latest commit

History

Repository files navigation

bitext-aligner (Parallel Corpus Creation)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages