Supervised-Morphological-Segmentation

Python implementation of Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields (Ruokolainen Teemu, et al. 2013)

First homework of the Natural Language Processing course, prof. Roberto Navigli.

University project • 2017 - Natural Language Processing - MSc in Computer Science, I year

The source code is is accompanied by a report (report.pdf), I strongly suggest you read it to understand the solution code. The original article can be found here.

Article abstract

We discuss data-driven morphological segmentation, in which word forms are segmented into morphs, the surface forms of morphemes. Our focus is on a lowresource learning setting, in which only a small amount of annotated word forms are available for model training, while unannotated word forms are available in abundance. The current state-of-art methods 1) exploit both the annotated and unannotated data in a semi-supervised manner, and 2) learn morph lexicons and subsequently uncover segmentations by generating the most likely morph sequences. In contrast, we discuss 1) employing only the annotated data in a supervised manner, while entirely ignoring the unannotated data, and 2) directly learning to predict morph boundaries given their local sub-string contexts instead of learning the morph lexicons. Specifically, we employ conditional random fields, a popular discriminative log-linear model for segmentation. We present experiments on two data sets comprising five diverse languages. We show that the fully supervised boundary prediction approach outperforms the state-of-art semi-supervised morph lexicon approaches on all languages when using the same annotated data sets.

Implementation details

The language chosen is Python 2.7, the library used are the sklearn_crfsuite library to implement the CRF and the pickle library to export the model. Following the model, the classification problem is well defined when each character of the dataset is represented by a binary vector of features and labeled with the appropriate class among the possible six. The feature vector is encoded as a dictionary, theoretically it should be very sparse with few 1 and a large amount of 0, nevertheless the crf_suite library is able to receive as input only the present features avoiding to build and deal with huge dictionaries. The dictionaries of each character of a word are organized in a list and all these lists are organized in another list representing the desired learning set (training, dev or test). The same procedure is performed for the labels, with a string containing the label instead of the dictionary. The described data structure coincides with the accepted input format of the CRF class of the crf_suite library. Referring to the main.py file, the acquisition of the datasets from the given files is performed in the first section of the code #COLLECT DATA AND LABELLING. The construction of the feature dictionaries and the organization in the correct data structure is accomplished by the prepare_data function defined in the #COMPUTE FEATURES section and used in the #DATA PREPARATION AND FIT section. In the final #EVALUATION section the Precision, Recall and F1 scores are computed and the results are printed on the console.

Results

Results are in general satisfactory, with a F1 score near 0.80. The results obtained with the extra features are slightly better, especially with few samples where improvement on the F1 score is up to a 3%.

See the report for a detailed discussion of the results and to see scores and graphs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
task_1/model		task_1/model
README.md		README.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supervised-Morphological-Segmentation

Article abstract

Implementation details

Results

About

Releases

Packages

Languages

noranta4/Supervised-Morphological-Segmentation

Folders and files

Latest commit

History

Repository files navigation

Supervised-Morphological-Segmentation

Article abstract

Implementation details

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages