Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #4 | Extend & Merge carryover ud dataloaders #247

Merged
merged 24 commits into from
May 31, 2024

Conversation

ijindal
Copy link
Contributor

@ijindal ijindal commented Dec 28, 2023

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Closes #4

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

@ijindal ijindal changed the title ud loaders for POS tagging Closes #4 | ud loaders for POS tagging Dec 28, 2023
@sabilmakbar
Copy link
Collaborator

Hi @ijindal, thanks for the contribution in constructing a dataloader for the SEACrowd Project!

At a glance, the implementation looks okay. However, it's not aligned with SEACrowd Dataloaders. In SEACrowd, one dataloader script is intended for one dataset, not a subset of a dataset. Hence, the expected code is a Dataset Loading Script where all subsets mentioned in the issue ticket are implemented. In that case, would you like to revise the implementation? Many thanks!

Copy link
Collaborator

@gentaiscool gentaiscool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the contribution. I suggest to separate the change for the global util in a new PR. @SamuelCahyawijaya @holylovenia fyi

seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud_id_csui.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud_id_gsd.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud_vi_vtb.py Outdated Show resolved Hide resolved
seacrowd/utils/common_parser.py Outdated Show resolved Hide resolved
@gentaiscool gentaiscool self-assigned this Jan 3, 2024
@gentaiscool
Copy link
Collaborator

Hi @ijindal, thanks for the contribution in constructing a dataloader for the SEACrowd Project!

At a glance, the implementation looks okay. However, it's not aligned with SEACrowd Dataloaders. In SEACrowd, one dataloader script is intended for one dataset, not a subset of a dataset. Hence, the expected code is a Dataset Loading Script where all subsets mentioned in the issue ticket are implemented. In that case, would you like to revise the implementation? Many thanks!

I agree with @sabilmakbar's suggestion. Let's make them separate.

@ijindal
Copy link
Contributor Author

ijindal commented Jan 4, 2024

Hi, I am working on fixing these issue.

However, I have one comment on UD parse. Do we really want separate folder for each dataset such as ud_vi_vtb? If we go by this it will blow up the number of folders in the data repo (Assuming more addition of SEA languages in UD)

@ijindal
Copy link
Contributor Author

ijindal commented May 22, 2024

Hi @ijindal, I'm sorry for the super late clarification. Since HF Dataset cannot detect two or more dataset classes on the same script, we have to create an if-else branching for each subset we're implementing for this dataloader.

I'd suggest to refer this implementation (https://huggingface.co/datasets/google/xtreme_s/blob/main/xtreme_s.py#L276) with adjustments: instead of using self.config.dataset_name as used in xTremeS, we can use self.config.subset_name instead.

Thank for this pointer. I have updated the reader as per the pointer.

@ijindal
Copy link
Contributor Author

ijindal commented May 22, 2024

@sabilmakbar please review it.

Copy link
Collaborator

@sabilmakbar sabilmakbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviews after trying to run the scripts

idk why the tagalog subsets can't be loaded (prob due to different .conll format). I'll take a look on it.

seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved



_SUPPORTED_TASKS = [Tasks.POS_TAGGING]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we have DEPENDENCY_PARSING and MACHINE_TRANSLATION tasks for this dataloader, which u can refer to the following dataloaders:

  1. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/ud_jv_csui/ud_jv_csui.py
  2. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/ud_id_csui/ud_id_csui.py

while the first two have all 3 tasks, the following dataloaders are also included in this dataset, but only DEPENDENCY_PARSING has been implemented previously. do you have any take whether it shd be extended to 2 other tasks, @holylovenia?
3. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/indolem_ud_id_gsd/indolem_ud_id_gsd.py
4. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/indolem_ud_id_pud/indolem_ud_id_pud.py

seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved
@sabilmakbar sabilmakbar changed the title Closes #4 | ud loaders for POS tagging Closes #4 | Extend & Merge carryover ud dataloaders May 24, 2024
@sabilmakbar
Copy link
Collaborator

sabilmakbar commented May 26, 2024

Hi @ijindal, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Hi @ijindal, do you think we will be able to finish it by the end of the month? I can help modify the dataloaders accordingly if you can't (and the point will still be fully attributed to yours regardless). Also, I think @elyanah-aco will be away next week, right? (So we can pass it on to another reviewer if she hasn't provided any approval by then.)

@sabilmakbar
Copy link
Collaborator

changing it to +3 with the consent from @holylovenia after discussing refactor & expanding subset efforts on this

ijindal and others added 6 commits May 28, 2024 19:32
Co-authored-by: Salsabil Maulana Akbar <maulana.1998@yahoo.co.id>
Co-authored-by: Salsabil Maulana Akbar <maulana.1998@yahoo.co.id>
Co-authored-by: Salsabil Maulana Akbar <maulana.1998@yahoo.co.id>
@ijindal
Copy link
Contributor Author

ijindal commented May 28, 2024

@sabilmakbar Yeah, I have provided the fix as per the comments. Let me know what tasks are remaining I can finish by end of this month.

@ijindal ijindal requested a review from sabilmakbar May 28, 2024 15:08
@ijindal
Copy link
Contributor Author

ijindal commented May 29, 2024

Also, I find that #592 issue is very similar to this issue. I do not think that it requires a separate loader. If you feel the same I can add the conll17 loader in this loader as well

@holylovenia
Copy link
Contributor

Hi @ijindal, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @sabilmakbar @elyanah-aco

@sabilmakbar
Copy link
Collaborator

Hi @ijindal, I've created an extended version that works for the other two tasks (but some subsets are yet to be implemented in all three tasks). If you don't mind, I'll code-diff it and push it to this branch (to speed up the process). I'll also add the test output log for each task (with subsets adjusted accordingly due to coverage). Also, for one of the subsets ud_id_gsd, I'm reverting it to the IndoLEM version (available on SEACrowd but on a different dataloader and not taken from the UD v2.13 repo).

Also, I find that #592 issue is very similar to this issue. I do not think that it requires a separate loader. If you feel the same I can add the conll17 loader in this loader as well

Prob we can think abt it later (not on this issue, tho).

extend `ud` dataloader to multiple tasks, adjust data loading methods based on existing dataloaders, and add custom citations per subsets
@sabilmakbar
Copy link
Collaborator

sabilmakbar commented May 31, 2024

my test-run result (in log file):
test_ud.log

how to run ($1 var is ud):

SUBSET_CONFIG=(id_csui id_pud id_gsd vi_vtb tl_trg tl_ugnayan)
schema="SEQ_LABEL"

for val in ${!SUBSET_CONFIG[@]}; do
    subset=${SUBSET_CONFIG[$val]}
    echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
    python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done

SUBSET_CONFIG=(id_csui id_gsd vi_vtb tl_trg)
schema="KB"

for val in ${!SUBSET_CONFIG[@]}; do
    subset=${SUBSET_CONFIG[$val]}
    echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
    python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done

SUBSET_CONFIG=(id_csui id_pud tl_trg tl_ugnayan)
schema="SEQ_LABEL"

for val in ${!SUBSET_CONFIG[@]}; do
    subset=${SUBSET_CONFIG[$val]}
    echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
    python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done

@sabilmakbar
Copy link
Collaborator

sabilmakbar commented May 31, 2024

for anyone wondering why the DEPENDENCY_PARSING isn't being implemented in all subsets, it's because there are some span errors from the data provided (according to the _URLs and subset_name), which for some of them I can't judge the correctness on morph exception due to language barrier.

@sabilmakbar sabilmakbar assigned fhudi and unassigned elyanah-aco May 31, 2024
Copy link
Collaborator

@fhudi fhudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Each subset has passed the tests.
The log output following previous review is attached: test_ud.log.
Passed all the checklist.


As per discussed previously, this PR won't include Javanese UD (and probably other subsets) yet, and it is left for another PR.

I think this PR is good to merge.
cc: @holylovenia

@sabilmakbar sabilmakbar merged commit 3d566ea into SEACrowd:master May 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend dataset loader for Universal Dependencies
8 participants