Reformatting the lexicon files from usas to tsv #1

apmoore1 · 2021-10-21T13:45:54Z

The following comment is the contents of the Changelog.md file.

Major changes

MWE = Multi Word Expression

Changed the file extension of all semantic lexicon files from .usas to .tsv and added the relevant header names e.g. lemma and semantic_tags for single word lexicons and mwe_template and semantic_tags for MWE lexicons in accordance with the Lexicon File Format section within the README.md
Added a License file which contains the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This file was added so that on GitHub under the About section in the right hand corner there is a link that says View License which will direct users to this License file.
Reformated the README.md so that it contains more structure. New content is added around File formats of single and MWE lexicon files.
Within the MWE lexicon for Chinese. Line 119 and 101 was removed as it contained a MWE template but no USAS tags. The MWE templates were 平顶_noun 女帽_noun and 一_num 法郎_msr N3.1/I1 respectively.
Within the single lexicon for Dutch. Line 218 we have added a tab so that the POS entry is now blank/None as no POS information existed, without adding this extra tab the TSV file would not be valid. The changed meant on line 218 it went from alstublieft E4.2+ E2+ X7+ Z4 to alstublieft E4.2+ E2+ X7+ Z4
For the Malay single lexicon, I tested to see if the first column was a repeat of the second column using the test_token_is_equal_to_lemma.py python script and it was. The first, token, and the second column, lemma, both represented the lemma field. In addition the Malay single lexicon also contained a POS column that only contained the word POS therefore this column was also removed.
Created a scripts section within the README.md, the scripts section contains explanations on what the new python scripts do and how to use them.
The Russian MWE semantic lexicon file contained many tab separation errors e.g. line 89 was без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8- which contains an extra tab between без_* {всякой/лишней/излишней/особой/особенной} and суеты_* when it should have been a single whitespace like: без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8-. All of these tab separation errors have been corrected.
The Russian single word semantic lexicon file both the lemma and token columns are identical, tested using test_token_is_equal_to_lemma.py python script.

Chinese/mwe-chi.tsv

apmoore1 · 2021-11-15T12:43:50Z

I think this is ready to be merged @perayson . Here is an updated list of the major changes that have been made in this pull request:

Major changes

MWE = Multi Word Expression

Changed the file extension of all semantic lexicon files from .usas to .txt and added the relevant header names e.g. lemma and semantic_tags for single word lexicons and mwe_template and semantic_tags for MWE lexicons in accordance with the Lexicon File Format section within the README.md
Added a License file which contains the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This file was added so that on GitHub under the About section in the right hand corner there is a link that says View License which will direct users to this License file.
Reformated the README.md so that it contains more structure. New content is added around File formats of single and MWE lexicon files.
Within the MWE lexicon for Chinese. Line 119 was removed as it contained a MWE template but no USAS tags. The MWE template on line 119 was 平顶_noun 女帽_noun.
For the Malay single lexicon, I tested to see if the first column was a repeat of the second column using the test_token_is_equal_to_lemma.py python script and it was. The first, token, and the second column, lemma, both represented the lemma field, therefore the token field/column has been removed. In addition the Malay single lexicon also contained a POS column that only contained the word POS therefore this column was also removed.
Created a scripts section within the README.md, the scripts section contains explanations on what the new python scripts do and how to use them.
The Russian MWE semantic lexicon file contained many tab separation errors e.g. line 89 was без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8- which contains an extra tab between без_* {всякой/лишней/излишней/особой/особенной} and суеты_* when it should have been a single whitespace like: без_* {всякой/лишней/излишней/особой/особенной} суеты_* N3.8-. All of these tab separation errors have been corrected.
The Russian single word semantic lexicon file both the lemma and token columns are identical, tested using test_token_is_equal_to_lemma.py python script.. Therefore the token column has been removed leaving only the lemma column.
The Portuguese single semantic lexicon file contained a few tab separation errors e.g. line 1070 was ansiar verb X7+ and is now ansiar verb X7+, if this was left as was then it would suggest that the semantic tags are verb instead of X7+.
The language_resources.json file has been added, which is a JSON file that contains meta data on what each lexicon resource file contains in this repository per language. This file is explained in the USAS Lexicon Meta Data section of the README.md
Removed ID column in the Urdu semantic lexicon file, as the ID only represented line number and nothing else.
Added CONTRIBUTING guidelines for contributing a lexicon resource.
Added GitHub action which will convert the lexicon resources created in text file format, following CONTRIBUTING guidelines, to TSV format. After conversion it will check that TSV files are formatted correctly, and if so it will add and commit the TSV files into the repository.

apmoore1 · 2021-11-15T15:01:08Z

The last two commits should fix #7

perayson · 2021-11-15T16:23:44Z

All merged, huge thanks!

apmoore1 added 24 commits October 20, 2021 09:09

License and some text around file format

2aaa4fc

Added more headers

51b46a9

Added formatting

a82465c

Single Word Lexicon file format

d2b3d28

MWE file format

26c00a1

Welsh changed to TSV and Python script for testing lexicon file format

158e589

Changed , to .

52f4afa

Chinese MWE and single semantic lexicon in new format

84ff9db

French single semantic lexicon in new format

ce89235

Stated the changes made so far

9c199bd

Arabic single semantic lexicon in new format

57ad3e4

Czech single semantic lexicon in new format

0d288fc

Dutch single semantic lexicon in new format

10145d7

Finish single semantic lexicon in new format

032b247

Italian single and MWE semantic lexicon in new format

0d1e146

Malay single semantic lexicon in new format

ecfe827

Portuguese single and MWE semantic lexicon in new format

863f698

Changed MWE file names

f0e07a1

Russain single and MWE semantic lexicon in new format

3e0fb1a

Spainsh single and MWE semantic lexicon in new format

700ed26

Swedish single and MWE semantic lexicon in new format

f551d4c

Urdu single semantic lexicon in new format

8bb98f5

Swedish single and MWE semantic lexicon in new format

1dcb3cb

Added header names

bd9232c

apmoore1 assigned perayson Oct 21, 2021

apmoore1 commented Nov 5, 2021

View reviewed changes

Chinese/mwe-chi.tsv Outdated Show resolved Hide resolved

Chinese/mwe-chi.tsv Outdated Show resolved Hide resolved

apmoore1 added 4 commits November 5, 2021 15:04

Correct line removed and original line kept

4b86c36

Updated changelog

146dece

Made the test_collection.py script stricter

cdfb438

Resolved tab space issue

6133b74

apmoore1 and others added 23 commits November 11, 2021 09:13

New meta data format explained

e782604

Updated with new lexicon resource meta data

06a9521

Added BCP 47 code to header of meta data statistics table

2cc04c1

The changlog line removed was fixed with commit 3efaa9b

9746657

Removed the token column as it was a duplicate of the lemma column

3ca5b56

Removed the token column as it was a duplicate of the lemma column

65aaf9b

Removed id column

9322a48

Renamed column from unknown to feature

e16fd12

Contributing guidelines

7b35d95

Arabic text file

d4381dd

All text versions of the lexicons

56f573d

Removed tabs that are not used in the header row

b8f3e83

Text to TSV file conversion

b55ffd8

Environment name issue resolved

7e2fccd

Converted TXT to TSV

f5206d7

Corrected removal mistake

f788e49

Converted TXT to TSV

3ef2f05

Formatting

91ff639

All major changes

c9e3ab0

Merge branch 'master' of github.com:apmoore1/Multilingual-USAS

21ac563

stop it from failing when nothing to commit

b5ad6b9

Corrected if statement syntax

307e2f4

Add CI badge

8dc65f6

apmoore1 and others added 2 commits November 15, 2021 14:40

Automatic lexicon statistics

85b8579

Updated lexicon statistics table

daf017b

perayson merged commit b0740fc into UCREL:master Nov 15, 2021

apmoore1 mentioned this pull request Nov 16, 2021

Auto Generating the Lexicon Statistics table in the README #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reformatting the lexicon files from usas to tsv #1

Reformatting the lexicon files from usas to tsv #1

apmoore1 commented Oct 21, 2021

apmoore1 commented Nov 15, 2021

apmoore1 commented Nov 15, 2021

perayson commented Nov 15, 2021

Reformatting the lexicon files from usas to tsv #1

Reformatting the lexicon files from usas to tsv #1

Conversation

apmoore1 commented Oct 21, 2021

Major changes

apmoore1 commented Nov 15, 2021

Major changes

apmoore1 commented Nov 15, 2021

perayson commented Nov 15, 2021