Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reformatting the lexicon files from usas to tsv #1

Merged
merged 59 commits into from
Nov 15, 2021

Conversation

apmoore1
Copy link
Member

The following comment is the contents of the Changelog.md file.

Major changes

MWE = Multi Word Expression

  1. Changed the file extension of all semantic lexicon files from .usas to .tsv and added the relevant header names e.g. lemma and semantic_tags for single word lexicons and mwe_template and semantic_tags for MWE lexicons in accordance with the Lexicon File Format section within the README.md
  2. Added a License file which contains the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This file was added so that on GitHub under the About section in the right hand corner there is a link that says View License which will direct users to this License file.
  3. Reformated the README.md so that it contains more structure. New content is added around File formats of single and MWE lexicon files.
  4. Within the MWE lexicon for Chinese. Line 119 and 101 was removed as it contained a MWE template but no USAS tags. The MWE templates were 平顶_noun 女帽_noun and 一_num 法郎_msr N3.1/I1 respectively.
  5. Within the single lexicon for Dutch. Line 218 we have added a tab so that the POS entry is now blank/None as no POS information existed, without adding this extra tab the TSV file would not be valid. The changed meant on line 218 it went from alstublieft E4.2+ E2+ X7+ Z4 to alstublieft E4.2+ E2+ X7+ Z4
  6. For the Malay single lexicon, I tested to see if the first column was a repeat of the second column using the test_token_is_equal_to_lemma.py python script and it was. The first, token, and the second column, lemma, both represented the lemma field. In addition the Malay single lexicon also contained a POS column that only contained the word POS therefore this column was also removed.
  7. Created a scripts section within the README.md, the scripts section contains explanations on what the new python scripts do and how to use them.
  8. The Russian MWE semantic lexicon file contained many tab separation errors e.g. line 89 was без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8- which contains an extra tab between без_* {всякой/лишней/излишней/особой/особенной} and суеты_* when it should have been a single whitespace like: без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8-. All of these tab separation errors have been corrected.
  9. The Russian single word semantic lexicon file both the lemma and token columns are identical, tested using test_token_is_equal_to_lemma.py python script.

Chinese/mwe-chi.tsv Outdated Show resolved Hide resolved
Chinese/mwe-chi.tsv Outdated Show resolved Hide resolved
@apmoore1
Copy link
Member Author

I think this is ready to be merged @perayson . Here is an updated list of the major changes that have been made in this pull request:

Major changes

MWE = Multi Word Expression

  1. Changed the file extension of all semantic lexicon files from .usas to .txt and added the relevant header names e.g. lemma and semantic_tags for single word lexicons and mwe_template and semantic_tags for MWE lexicons in accordance with the Lexicon File Format section within the README.md
  2. Added a License file which contains the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This file was added so that on GitHub under the About section in the right hand corner there is a link that says View License which will direct users to this License file.
  3. Reformated the README.md so that it contains more structure. New content is added around File formats of single and MWE lexicon files.
  4. Within the MWE lexicon for Chinese. Line 119 was removed as it contained a MWE template but no USAS tags. The MWE template on line 119 was 平顶_noun 女帽_noun.
  5. For the Malay single lexicon, I tested to see if the first column was a repeat of the second column using the test_token_is_equal_to_lemma.py python script and it was. The first, token, and the second column, lemma, both represented the lemma field, therefore the token field/column has been removed. In addition the Malay single lexicon also contained a POS column that only contained the word POS therefore this column was also removed.
  6. Created a scripts section within the README.md, the scripts section contains explanations on what the new python scripts do and how to use them.
  7. The Russian MWE semantic lexicon file contained many tab separation errors e.g. line 89 was без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8- which contains an extra tab between без_* {всякой/лишней/излишней/особой/особенной} and суеты_* when it should have been a single whitespace like: без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8-. All of these tab separation errors have been corrected.
  8. The Russian single word semantic lexicon file both the lemma and token columns are identical, tested using test_token_is_equal_to_lemma.py python script.. Therefore the token column has been removed leaving only the lemma column.
  9. The Portuguese single semantic lexicon file contained a few tab separation errors e.g. line 1070 was ansiar verb X7+ and is now ansiar verb X7+, if this was left as was then it would suggest that the semantic tags are verb instead of X7+.
  10. The language_resources.json file has been added, which is a JSON file that contains meta data on what each lexicon resource file contains in this repository per language. This file is explained in the USAS Lexicon Meta Data section of the README.md
  11. Removed ID column in the Urdu semantic lexicon file, as the ID only represented line number and nothing else.
  12. Added CONTRIBUTING guidelines for contributing a lexicon resource.
  13. Added GitHub action which will convert the lexicon resources created in text file format, following CONTRIBUTING guidelines, to TSV format. After conversion it will check that TSV files are formatted correctly, and if so it will add and commit the TSV files into the repository.

@apmoore1
Copy link
Member Author

The last two commits should fix #7

@perayson perayson merged commit b0740fc into UCREL:master Nov 15, 2021
@perayson
Copy link
Member

All merged, huge thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants