Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reformatting the lexicon files from usas to tsv #1

Merged
merged 59 commits into from
Nov 15, 2021
Merged
Changes from 1 commit
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
2aaa4fc
License and some text around file format
apmoore1 Oct 20, 2021
51b46a9
Added more headers
apmoore1 Oct 20, 2021
a82465c
Added formatting
apmoore1 Oct 20, 2021
d2b3d28
Single Word Lexicon file format
apmoore1 Oct 20, 2021
26c00a1
MWE file format
apmoore1 Oct 20, 2021
158e589
Welsh changed to TSV and Python script for testing lexicon file format
apmoore1 Oct 20, 2021
52f4afa
Changed , to .
apmoore1 Oct 20, 2021
84ff9db
Chinese MWE and single semantic lexicon in new format
apmoore1 Oct 20, 2021
ce89235
French single semantic lexicon in new format
apmoore1 Oct 20, 2021
9c199bd
Stated the changes made so far
apmoore1 Oct 20, 2021
57ad3e4
Arabic single semantic lexicon in new format
apmoore1 Oct 20, 2021
0d288fc
Czech single semantic lexicon in new format
apmoore1 Oct 20, 2021
10145d7
Dutch single semantic lexicon in new format
apmoore1 Oct 20, 2021
032b247
Finish single semantic lexicon in new format
apmoore1 Oct 21, 2021
0d1e146
Italian single and MWE semantic lexicon in new format
apmoore1 Oct 21, 2021
ecfe827
Malay single semantic lexicon in new format
apmoore1 Oct 21, 2021
863f698
Portuguese single and MWE semantic lexicon in new format
apmoore1 Oct 21, 2021
f0e07a1
Changed MWE file names
apmoore1 Oct 21, 2021
3e0fb1a
Russain single and MWE semantic lexicon in new format
apmoore1 Oct 21, 2021
700ed26
Spainsh single and MWE semantic lexicon in new format
apmoore1 Oct 21, 2021
f551d4c
Swedish single and MWE semantic lexicon in new format
apmoore1 Oct 21, 2021
8bb98f5
Urdu single semantic lexicon in new format
apmoore1 Oct 21, 2021
1dcb3cb
Swedish single and MWE semantic lexicon in new format
apmoore1 Oct 21, 2021
bd9232c
Added header names
apmoore1 Oct 21, 2021
4b86c36
Correct line removed and original line kept
apmoore1 Nov 5, 2021
146dece
Updated changelog
apmoore1 Nov 5, 2021
cdfb438
Made the test_collection.py script stricter
apmoore1 Nov 6, 2021
6133b74
Resolved tab space issue
apmoore1 Nov 6, 2021
1b4e53c
Added APA citation format
apmoore1 Nov 9, 2021
d1eaf07
Script that tests the format of all of the lexicon files
apmoore1 Nov 9, 2021
fdae9b6
Whether spelling mistake corrected
apmoore1 Nov 9, 2021
3efaa9b
Added POS tag intj
apmoore1 Nov 9, 2021
01b0be6
Lexicon resource statistics
apmoore1 Nov 10, 2021
9c32a0f
Language resources reformatted and contain BCP 47 language codes
apmoore1 Nov 10, 2021
e782604
New meta data format explained
apmoore1 Nov 11, 2021
06a9521
Updated with new lexicon resource meta data
apmoore1 Nov 11, 2021
2cc04c1
Added BCP 47 code to header of meta data statistics table
apmoore1 Nov 11, 2021
9746657
The changlog line removed was fixed with commit 3efaa9b
apmoore1 Nov 11, 2021
3ca5b56
Removed the token column as it was a duplicate of the lemma column
apmoore1 Nov 11, 2021
65aaf9b
Removed the token column as it was a duplicate of the lemma column
apmoore1 Nov 11, 2021
9322a48
Removed id column
apmoore1 Nov 11, 2021
e16fd12
Renamed column from unknown to feature
apmoore1 Nov 11, 2021
7b35d95
Contributing guidelines
apmoore1 Nov 12, 2021
d4381dd
Arabic text file
apmoore1 Nov 12, 2021
56f573d
All text versions of the lexicons
apmoore1 Nov 12, 2021
b8f3e83
Removed tabs that are not used in the header row
apmoore1 Nov 15, 2021
b55ffd8
Text to TSV file conversion
apmoore1 Nov 15, 2021
7e2fccd
Environment name issue resolved
apmoore1 Nov 15, 2021
f5206d7
Converted TXT to TSV
github-actions[bot] Nov 15, 2021
f788e49
Corrected removal mistake
apmoore1 Nov 15, 2021
3ef2f05
Converted TXT to TSV
github-actions[bot] Nov 15, 2021
91ff639
Formatting
apmoore1 Nov 15, 2021
c9e3ab0
All major changes
apmoore1 Nov 15, 2021
21ac563
Merge branch 'master' of github.com:apmoore1/Multilingual-USAS
apmoore1 Nov 15, 2021
b5ad6b9
stop it from failing when nothing to commit
apmoore1 Nov 15, 2021
307e2f4
Corrected if statement syntax
apmoore1 Nov 15, 2021
8dc65f6
Add CI badge
apmoore1 Nov 15, 2021
85b8579
Automatic lexicon statistics
apmoore1 Nov 15, 2021
daf017b
Updated lexicon statistics table
github-actions[bot] Nov 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Contributing guidelines
  • Loading branch information
apmoore1 committed Nov 12, 2021
commit 7b35d951944401caeebbbbd2cb00ef7c67ed722f
26 changes: 26 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Contributing Guidelines

Thank you for considering contributing to the USAS lexicon resources, these guidelines will help you understand how the repository is organised and how you can help add to it.

## Contributing a Lexicon Resource

All lexicon resources are orgainsed based on their language name, for example all of the `Welsh` lexicons are within the [../Welsh](../Welsh) folder. Further we also have a meta data file, [../lexicon_resources.json](../lexicon_resources.json), which describes these resources in more detail and in a structured format.

The steps to take when wanting to contribute a lexicon resource:

1. When constructing the resource it must be written following the format of the resource you want to construct:
1. For single word lexicon files, follow the [single word lexicon file format described in the README.](../README.md#single-word-lexicon-file-format)
2. For Multi Word Expression (MWE) lexicon files, follow the [MWE lexicon file format described in the README,](../README.md#multi-word-expression-mwe-lexicon-file-format)

In both cases these files are described as TSV files, **however** these files at this stage should be **text files**. They should follow TSV files with respect to having header names and tabs separating the columns. The main differences is that with these files you can also include comments using a `#` symbol at the start of each comment line **but** the comment line cannot include a tab character.

2. Once you have constructed your text file version of the lexicon file, **fork** this GitHub repository so that you can make a **pull request** in step 5.
3. Within your forked version of this repository, save your constructed text version of the lexicon file to the relevant language name folder, if your language name folder does not exist create one.
4. Add your resource to the meta data file, [../lexicon_resources.json](../lexicon_resources.json), of which the [USAS Lexicon Meta Data section of the README](../README.md#usas-lexicon-meta-data) should be a good guide on how to add your resource to the meta data file. **NOTE** that the file path to your lexicon file should have the `.tsv` file extension rather than the expected `.txt`, as in a later step our code will automatically create a `TSV` file from your text file version of the lexicon.
5. Commit your changes to your forked version of the repository, and submit your pull request.

Once your pull request has been submitted, GitHub actions will perform some validation checks on you lexicon file, and then convert it from a **text file** to a **TSV file** in doing so it will remove all comments so that only the text file will have comments and TSV file will be comment free and represent a standard TSV file. If any of the validations checks fail, we will work with you on the pull request so that your lexicon passes all of the validation checks.

## Any problems, contact us

If you have any problems let us know at: `ucrel@lancaster.ac.uk`
3 changes: 2 additions & 1 deletion Changelog.md
Original file line number Diff line number Diff line change
@@ -12,4 +12,5 @@ MWE = Multi Word Expression
8. The [Russian single word semantic lexicon file](./Russian/semantic_lexicon_rus.tsv) both the `lemma` and `token` columns are identical, tested using [test_token_is_equal_to_lemma.py python script.](./test_token_is_equal_to_lemma.py). Therefore the `token` column has been removed leaving only the `lemma` column.
9. The [Portuguese single semantic lexicon file](./Portuguese/semantic_lexicon_pt.tsv) contained a few tab separation errors e.g. line 1070 was `ansiar verb X7+` and is now `ansiar verb X7+`, if this was left as was then it would suggest that the semantic tags are `verb` instead of `X7+`.
10. The `language_resources.json` file has been added, which is a JSON file that contains meta data on what each lexicon resource file contains in this repository per language. This file is explained in the `USAS Lexicon Meta Data` section of the `README.md`
11. Removed `ID` column in the [Urdu semantic lexicon file](./Urdu/Urdu_Semantic_Lexicon.tsv), as the ID only represented line number and nothing else.
11. Removed `ID` column in the [Urdu semantic lexicon file](./Urdu/Urdu_Semantic_Lexicon.tsv), as the ID only represented line number and nothing else.
12. Added [CONTRIBUTING guidelines](./CONTRIBUTING.md) for contributing a lexicon resource.