This repository is for the paper Translation-based Lexicalization Generation and Lexical Gap Detection: Application to Kinship Terms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6891–6900, Bangkok, Thailand. Association for Computational Linguistics.
Constructing lexicons with explicitly identified lexical gaps is a vital part of building multilingual lexical resources. Prior work has leveraged bilingual dictionaries and linguistic typologies for semi-automatic identification of lexical gaps. Instead, we propose a generally-applicable algorithmic method to automatically generate concept lexicalizations, which is based on machine translation and hypernymy relations between concepts. The absence of a lexicalization implies a lexical gap. We apply our method to kinship terms, which make a suitable case study because of their explicit definitions and regular structure. Empirical evaluations demonstrate that our approach yields higher accuracy than BabelNet and ChatGPT. Our error analysis indicates that enhancing the quality of translations can further improve the accuracy of our method.
Microsoft Windows 10, Version 10.0.19045.
Note: this OS version was used for development and evaluation, but is not necessarily required for running
Python 3.9.12
Note: this Python version was used for development and evaluation, but is not necessarily required for running
The .csv file "Integrated Resource.csv" is a comprehensive mapping resource with one kinship concept per row. The columns are: Subdomain, Concept, Abbreviation, Hypernyms, All words, Dev words, Test words, Category, Auto gloss, Seed language, Seed word, Actual gloss, WN synset ID, WN synset, BN synset ID, BN synset, and WN/BN gloss.
This folder contains the implementation of the method described in the paper and a file to evaluate the results. The files in this folder allow you to replicate the experiment in Section 5 of our paper, to generate the results we reported in Table 3.
This file contains the implementation of the method described in section 4 of the paper, in which the used translator is Google Translate. The output of this code is a ".txt" file in the "Results" folder. This code requires one additional package other than the default packages, googletrans==3.1.0a0 This file takes 3 arguments.
-l takes a string input that indicates the target language of this run.
-t takes a boolean input that indicates whether to use the prepared set of translations in the folder "Translation".
-b takes a boolean input that indicates whether to use the prepared set of results of back translations in the folder "Translation".
The default target language is English, and the prepared translation and back-translation results will are not used by default.
Sample run:
python goolge_trans_method_cleaned.py -l English -t True -b True
To replicate the results on the 13 languages we reported on the paper, other than the command in the sample run, 12 more commands are needed, one for each remaining language. The remaining commands can be made by replacing "English" with the target language. We reported our results on English, Spanish, Russian, French, German, Mandarin, Persian, Polish, Arabic, Italian, Mongolian, Hungarian and Hindi.
This file contains the code to evaluate the results outputted by "goolge_trans_method_cleaned.py". The output of this code is the Acc of LexGen and F1 Score of LexGap. This file takes 1 argument.
-l takes a string input that indicates the target language.
The default target language is English.
Sample run:
python evaluate_cleaned_version.py -l English
To evaluate the results on other languages, replace "English" in the sample run with the target language you want.
This folder contains a Python file to generate all potential speaker's gender-unspecific concepts. The files in this folder allow you to generate the results we reported in Table 2.
The Python file to generate all potential speaker's gender-unspecific concepts, and the 2 outputted files are: "new_concepts" and "new_relations". This file can be run directly.
Sample run:
python generating_all_concepts_edges.py
The outputted file "new_concepts" contains 3 columns: The subdomain of the concept, the newly generated concept name for the concept, and the abbr name of the concept. The outputted file "new_relations" contains 3 columns: The subdomain of this edge, the abbr name of the hyponym concept, and the abbr name of the hypernym concept.
- Senyu Li - senyu@ualberta.ca
- Bradley Hauer - bmhauer@ualberta.ca
- Ning Shi - ning.shi@ualberta.ca
- Grzegorz Kondrak - gkondrak@ualberta.ca
@inproceedings{li-etal-2024-translation,
title = "Translation-based Lexicalization Generation and Lexical Gap Detection: Application to Kinship Terms",
author = "Li, Senyu and
Hauer, Bradley and
Shi, Ning and
Kondrak, Grzegorz",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.372",
pages = "6891--6900",
abstract = "Constructing lexicons with explicitly identified lexical gaps is a vital part of building multilingual lexical resources. Prior work has leveraged bilingual dictionaries and linguistic typologies for semi-automatic identification of lexical gaps. Instead, we propose a generally-applicable algorithmic method to automatically generate concept lexicalizations, which is based on machine translation and hypernymy relations between concepts. The absence of a lexicalization implies a lexical gap. We apply our method to kinship terms, which make a suitable case study because of their explicit definitions and regular structure. Empirical evaluations demonstrate that our approach yields higher accuracy than BabelNet and ChatGPT. Our error analysis indicates that enhancing the quality of translations can further improve the accuracy of our method.",
}