Closes #4 | Extend & Merge carryover ud dataloaders #247

ijindal · 2023-12-28T07:05:01Z

Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.

Closes #4

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

sabilmakbar · 2024-01-02T12:08:28Z

Hi @ijindal, thanks for the contribution in constructing a dataloader for the SEACrowd Project!

At a glance, the implementation looks okay. However, it's not aligned with SEACrowd Dataloaders. In SEACrowd, one dataloader script is intended for one dataset, not a subset of a dataset. Hence, the expected code is a Dataset Loading Script where all subsets mentioned in the issue ticket are implemented. In that case, would you like to revise the implementation? Many thanks!

gentaiscool

thank you for the contribution. I suggest to separate the change for the global util in a new PR. @SamuelCahyawijaya @holylovenia fyi

seacrowd/sea_datasets/ud/ud.py

seacrowd/sea_datasets/ud/ud_id_csui.py

seacrowd/sea_datasets/ud/ud_id_gsd.py

seacrowd/sea_datasets/ud/ud_vi_vtb.py

seacrowd/utils/common_parser.py

gentaiscool · 2024-01-03T04:01:25Z

Hi @ijindal, thanks for the contribution in constructing a dataloader for the SEACrowd Project!

At a glance, the implementation looks okay. However, it's not aligned with SEACrowd Dataloaders. In SEACrowd, one dataloader script is intended for one dataset, not a subset of a dataset. Hence, the expected code is a Dataset Loading Script where all subsets mentioned in the issue ticket are implemented. In that case, would you like to revise the implementation? Many thanks!

I agree with @sabilmakbar's suggestion. Let's make them separate.

ijindal · 2024-01-04T04:44:45Z

Hi, I am working on fixing these issue.

However, I have one comment on UD parse. Do we really want separate folder for each dataset such as ud_vi_vtb? If we go by this it will blow up the number of folders in the data repo (Assuming more addition of SEA languages in UD)

ijindal · 2024-05-22T17:34:24Z

Hi @ijindal, I'm sorry for the super late clarification. Since HF Dataset cannot detect two or more dataset classes on the same script, we have to create an if-else branching for each subset we're implementing for this dataloader.

I'd suggest to refer this implementation (https://huggingface.co/datasets/google/xtreme_s/blob/main/xtreme_s.py#L276) with adjustments: instead of using self.config.dataset_name as used in xTremeS, we can use self.config.subset_name instead.

Thank for this pointer. I have updated the reader as per the pointer.

ijindal · 2024-05-22T17:35:00Z

@sabilmakbar please review it.

sabilmakbar

reviews after trying to run the scripts

idk why the tagalog subsets can't be loaded (prob due to different .conll format). I'll take a look on it.

seacrowd/sea_datasets/ud/ud.py

sabilmakbar · 2024-05-24T07:57:57Z

seacrowd/sea_datasets/ud/ud.py

+
+
+
+_SUPPORTED_TASKS = [Tasks.POS_TAGGING]


previously we have DEPENDENCY_PARSING and MACHINE_TRANSLATION tasks for this dataloader, which u can refer to the following dataloaders:

https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/ud_jv_csui/ud_jv_csui.py

https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/ud_id_csui/ud_id_csui.py

while the first two have all 3 tasks, the following dataloaders are also included in this dataset, but only DEPENDENCY_PARSING has been implemented previously. do you have any take whether it shd be extended to 2 other tasks, @holylovenia?
3. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/indolem_ud_id_gsd/indolem_ud_id_gsd.py
4. https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/sea_datasets/indolem_ud_id_pud/indolem_ud_id_pud.py

seacrowd/sea_datasets/ud/ud.py

sabilmakbar · 2024-05-26T10:38:10Z

Hi @ijindal, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Hi @ijindal, do you think we will be able to finish it by the end of the month? I can help modify the dataloaders accordingly if you can't (and the point will still be fully attributed to yours regardless). Also, I think @elyanah-aco will be away next week, right? (So we can pass it on to another reviewer if she hasn't provided any approval by then.)

sabilmakbar · 2024-05-26T10:40:02Z

changing it to +3 with the consent from @holylovenia after discussing refactor & expanding subset efforts on this

Co-authored-by: Salsabil Maulana Akbar <maulana.1998@yahoo.co.id>

ijindal · 2024-05-28T15:08:37Z

@sabilmakbar Yeah, I have provided the fix as per the comments. Let me know what tasks are remaining I can finish by end of this month.

ijindal · 2024-05-29T01:54:36Z

Also, I find that #592 issue is very similar to this issue. I do not think that it requires a separate loader. If you feel the same I can add the conll17 loader in this loader as well

holylovenia · 2024-05-30T04:38:19Z

Hi @ijindal, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @sabilmakbar @elyanah-aco

sabilmakbar · 2024-05-31T02:34:08Z

Hi @ijindal, I've created an extended version that works for the other two tasks (but some subsets are yet to be implemented in all three tasks). If you don't mind, I'll code-diff it and push it to this branch (to speed up the process). I'll also add the test output log for each task (with subsets adjusted accordingly due to coverage). Also, for one of the subsets ud_id_gsd, I'm reverting it to the IndoLEM version (available on SEACrowd but on a different dataloader and not taken from the UD v2.13 repo).

Also, I find that #592 issue is very similar to this issue. I do not think that it requires a separate loader. If you feel the same I can add the conll17 loader in this loader as well

Prob we can think abt it later (not on this issue, tho).

extend `ud` dataloader to multiple tasks, adjust data loading methods based on existing dataloaders, and add custom citations per subsets

sabilmakbar · 2024-05-31T08:39:03Z

my test-run result (in log file):
test_ud.log

how to run ($1 var is ud):

SUBSET_CONFIG=(id_csui id_pud id_gsd vi_vtb tl_trg tl_ugnayan)
schema="SEQ_LABEL"

for val in ${!SUBSET_CONFIG[@]}; do
    subset=${SUBSET_CONFIG[$val]}
    echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
    python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done

SUBSET_CONFIG=(id_csui id_gsd vi_vtb tl_trg)
schema="KB"

for val in ${!SUBSET_CONFIG[@]}; do
    subset=${SUBSET_CONFIG[$val]}
    echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
    python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done

SUBSET_CONFIG=(id_csui id_pud tl_trg tl_ugnayan)
schema="SEQ_LABEL"

for val in ${!SUBSET_CONFIG[@]}; do
    subset=${SUBSET_CONFIG[$val]}
    echo "Executing Extractor on iteration no $((val+1)) of total ${#SUBSET_CONFIG[@]} for subset $subset and schema $schema"
    python -m tests.test_seacrowd seacrowd/sea_datasets/$1/$1.py --subset_id "$1_$subset" --schema $schema
done

sabilmakbar · 2024-05-31T08:44:34Z

for anyone wondering why the DEPENDENCY_PARSING isn't being implemented in all subsets, it's because there are some span errors from the data provided (according to the _URLs and subset_name), which for some of them I can't judge the correctness on morph exception due to language barrier.

seacrowd/utils/common_parser.py

fhudi

LGTM.
Each subset has passed the tests.
The log output following previous review is attached: test_ud.log.
Passed all the checklist.

As per discussed previously, this PR won't include Javanese UD (and probably other subsets) yet, and it is left for another PR.

I think this PR is good to merge.
cc: @holylovenia

U-CORP\ishan.jindal and others added 3 commits December 28, 2023 12:08

add initiaizer

4f3e704

add ud driver

74239d8

subsets datasets loaders

a40363c

ijindal requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, RosenZhang and fajri91 as code owners December 28, 2023 07:05

Update common_parser.py for utf8 encoding

c8ce267

ijindal changed the title ~~ud loaders for POS tagging~~ Closes #4 | ud loaders for POS tagging Dec 28, 2023

gentaiscool requested changes Jan 3, 2024

View reviewed changes

gentaiscool self-assigned this Jan 3, 2024

holylovenia assigned ryanignatius Jan 3, 2024

holylovenia requested review from gentaiscool and ryanignatius and removed request for jamesjaya, SamuelCahyawijaya, fajri91, ljvmiranda921, RosenZhang, holylovenia, yongzx and sabilmakbar January 3, 2024 08:08

updated as per SEACrowd#247 (comment)

899f089

sabilmakbar reviewed May 24, 2024

View reviewed changes

sabilmakbar changed the title ~~Closes #4 | ud loaders for POS tagging~~ Closes #4 | Extend & Merge carryover ud dataloaders May 24, 2024

sabilmakbar reviewed May 24, 2024

View reviewed changes

seacrowd/sea_datasets/ud/ud.py Outdated Show resolved Hide resolved

sabilmakbar added bonus +3 and removed bonus +1 labels May 26, 2024

ijindal and others added 6 commits May 28, 2024 19:32

Update seacrowd/sea_datasets/ud/ud.py

8aeab01

Co-authored-by: Salsabil Maulana Akbar <maulana.1998@yahoo.co.id>

Update seacrowd/sea_datasets/ud/ud.py

a4ede1a

Co-authored-by: Salsabil Maulana Akbar <maulana.1998@yahoo.co.id>

Update seacrowd/sea_datasets/ud/ud.py

12b1205

Co-authored-by: Salsabil Maulana Akbar <maulana.1998@yahoo.co.id>

Merge branch 'SEACrowd:master' into master

2571948

fix few of the comments

7bef0ce

fix few of the comments

d03db9c

ijindal requested a review from sabilmakbar May 28, 2024 15:08

Update ud.py

ebc5d66

extend `ud` dataloader to multiple tasks, adjust data loading methods based on existing dataloaders, and add custom citations per subsets

fhudi reviewed May 31, 2024

View reviewed changes

seacrowd/utils/common_parser.py Outdated Show resolved Hide resolved

sabilmakbar assigned fhudi and unassigned elyanah-aco May 31, 2024

Merge branch 'master' into master

874b2b4

fhudi approved these changes May 31, 2024

View reviewed changes

sabilmakbar merged commit 3d566ea into SEACrowd:master May 31, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #4 | Extend & Merge carryover ud dataloaders #247

Closes #4 | Extend & Merge carryover ud dataloaders #247

ijindal commented Dec 28, 2023 •

edited

Loading

sabilmakbar commented Jan 2, 2024

gentaiscool left a comment •

edited

Loading

gentaiscool commented Jan 3, 2024

ijindal commented Jan 4, 2024

ijindal commented May 22, 2024

ijindal commented May 22, 2024

sabilmakbar left a comment

sabilmakbar May 24, 2024

sabilmakbar commented May 26, 2024 •

edited

Loading

sabilmakbar commented May 26, 2024

ijindal commented May 28, 2024

ijindal commented May 29, 2024

holylovenia commented May 30, 2024

sabilmakbar commented May 31, 2024

sabilmakbar commented May 31, 2024 •

edited

Loading

sabilmakbar commented May 31, 2024 •

edited

Loading

fhudi left a comment




		_SUPPORTED_TASKS = [Tasks.POS_TAGGING]

Closes #4 | Extend & Merge carryover ud dataloaders #247

Closes #4 | Extend & Merge carryover ud dataloaders #247

Conversation

ijindal commented Dec 28, 2023 • edited Loading

Checkbox

sabilmakbar commented Jan 2, 2024

gentaiscool left a comment • edited Loading

Choose a reason for hiding this comment

gentaiscool commented Jan 3, 2024

ijindal commented Jan 4, 2024

ijindal commented May 22, 2024

ijindal commented May 22, 2024

sabilmakbar left a comment

Choose a reason for hiding this comment

sabilmakbar May 24, 2024

Choose a reason for hiding this comment

sabilmakbar commented May 26, 2024 • edited Loading

sabilmakbar commented May 26, 2024

ijindal commented May 28, 2024

ijindal commented May 29, 2024

holylovenia commented May 30, 2024

sabilmakbar commented May 31, 2024

sabilmakbar commented May 31, 2024 • edited Loading

sabilmakbar commented May 31, 2024 • edited Loading

fhudi left a comment

Choose a reason for hiding this comment

ijindal commented Dec 28, 2023 •

edited

Loading

gentaiscool left a comment •

edited

Loading

sabilmakbar commented May 26, 2024 •

edited

Loading

sabilmakbar commented May 31, 2024 •

edited

Loading

sabilmakbar commented May 31, 2024 •

edited

Loading