Closes #536 | Add/Update Dataloader Onto4All #635

patrickamadeus · 2024-04-09T07:44:09Z

Closes #536

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[.] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

TESTS

NOTES

Please use huggingface-cli or insert your API_KEY to token= parameter in load_dataset method since this is a gated dataset 😄

SamuelCahyawijaya

Hi @patrickamadeus, thank you for contributing! I found two problems in this dataloader:

Somehow the conversation data is empty on both _source and _seacrowd_* schemas

For the _seacrowd_* schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

seacrowd/sea_datasets/onto4all/onto4all.py

patrickamadeus · 2024-05-13T08:15:14Z

Hi @patrickamadeus, thank you for contributing! I found two problems in this dataloader:

Somehow the conversation data is empty on both _source and _seacrowd_* schemas

2. For the `_seacrowd_*` schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

Hi @SamuelCahyawijaya ! Thank you for the review 😄 .

Please kindly check the latest commit for the fix.

holylovenia · 2024-05-13T08:24:22Z

For the _seacrowd_* schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

Sorry for the late reply, I missed this mention. Is this supposedly to accommodate a multi-turn chat template with the user, assistant, and system roles?

While this qa schema seems to fit the dataset well, I think it's better if this dataloader has a different task (e.g., MULTI_TURN_CONVERSATION) with a new schema (with a messages variable like this) to facilitate similar datasets in the future. It will prevent this dataset from being overlooked as another QA task too.

What do you think, @patrickamadeus @SamuelCahyawijaya @yongzx?

cc: @sabilmakbar

SamuelCahyawijaya · 2024-05-13T09:43:16Z

@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?

the schema would be basically consists of input, output, and meta.

input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }
output would be the expected response of the model, in this case it would be the last turn of conversation from gpt
meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

holylovenia · 2024-05-13T10:04:56Z

should we propose the new schema and adjust the score accordingly?

I'm of this opinion.

the schema would be basically consists of input, output, and meta.

input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }

output would be the expected response of the model, in this case it would be the last turn of conversation from gpt

meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

Let's normalize it for the seacrowd schema.

SamuelCahyawijaya · 2024-05-13T10:25:31Z

@patrickamadeus : would it be ok for you to create the new schema, and adjust the dataloader accordingly?

patrickamadeus · 2024-05-13T10:47:47Z

With pleasure @SamuelCahyawijaya !

@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?

the schema would be basically consists of input, output, and meta.

input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }

output would be the expected response of the model, in this case it would be the last turn of conversation from gpt

meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

I will refer the schema from here for now.

patrickamadeus · 2024-05-13T11:24:48Z

Hi @SamuelCahyawijaya @holylovenia !

Could you please review the new schema and implementation? I named it chat feature for now, feel free to suggest any change!

holylovenia · 2024-05-13T11:46:27Z

Hi @SamuelCahyawijaya @holylovenia !

Could you please review the new schema and implementation? I named it chat feature for now, feel free to suggest any change!

The schema looks great to me! Let us know if you've separated the schema and new task so we can approve it.

patrickamadeus · 2024-05-19T08:35:20Z

It's done @SamuelCahyawijaya @holylovenia .

holylovenia · 2024-05-21T08:24:07Z

It's done @SamuelCahyawijaya @holylovenia .

Could you please link the PR for the new schema and task here, @patrickamadeus?

cc: @sabilmakbar because I'll put more focus on the experiments going forward.

patrickamadeus · 2024-05-21T08:29:25Z

Oops, sorry!

I put them altogether here in the last commit 😬 @holylovenia .

Should I create a separate PR for it? Sorry for my ignorance.

holylovenia · 2024-05-21T08:34:41Z

Oops, sorry!

I put them altogether here in the last commit 😬 @holylovenia .

Should I create a separate PR for it? Sorry for my ignorance.

Yes, it'd be great if we could have a separate PR. Thanks in advance, @patrickamadeus!!

patrickamadeus · 2024-05-22T16:09:05Z

Hi! here is the chat schema PR #679 @sabilmakbar @holylovenia

sabilmakbar · 2024-05-23T12:04:42Z

Quick question: What's the difference between using this new chat schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.

holylovenia · 2024-05-28T12:54:40Z

Quick question: What's the difference between using this new chat schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.

TOD relies on belief state and system act apart from the utterances. In practice, most TOD works are a derivative of or follow the WOZ dataset's style, so it would be better to keep that schema for TOD.

holylovenia · 2024-05-30T04:40:47Z

Hi @patrickamadeus, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @yongzx @SamuelCahyawijaya @sabilmakbar

sabilmakbar · 2024-05-31T10:21:29Z

seacrowd/utils/schemas/chat.py

+    {
+        "id": datasets.Value("string"),
+        "input": datasets.Sequence({
+            "role": datasets.ClassLabel(names=["system", "user", "assistant"]),


hi @SamuelCahyawijaya @yongzx just letting you know the changes on schema has been merged to master, but with this role field being changed to string (datasets.Value("string")) due to possibilities of additional/custom roles and HF mechanics that return an indices of the label for their examples had it been set as ClassLabel (which is less intuitive than string)

holylovenia · 2024-07-08T06:06:49Z

Hi @patrickamadeus, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @yongzx @SamuelCahyawijaya @sabilmakbar

feat: add Onto4All dataloader

9e6036d

patrickamadeus requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners April 9, 2024 07:44

holylovenia removed request for tellarin, gentaiscool, jamesjaya, SamuelCahyawijaya, ljvmiranda921, holylovenia, MJonibek, danjohnvelasco and sabilmakbar April 27, 2024 15:06

holylovenia assigned yongzx and jensan-1 Apr 27, 2024

holylovenia requested review from SamuelCahyawijaya and removed request for jensan-1 April 27, 2024 15:08

holylovenia assigned SamuelCahyawijaya and unassigned jensan-1 Apr 27, 2024

SamuelCahyawijaya requested changes May 2, 2024

View reviewed changes

Merge branch 'master' into onto4all

0b97c0b

yongzx reviewed May 5, 2024

View reviewed changes

seacrowd/sea_datasets/onto4all/onto4all.py Show resolved Hide resolved

patrickamadeus added 3 commits May 13, 2024 15:11

fix: torubleshooting seacrowd_qa

97c34d6

Merge remote-tracking branch 'upstream/master' into onto4all

d59b656

nitpick

30a7a69

SamuelCahyawijaya added the bonus +2 label May 13, 2024

feat: chat agent schema

efa8d51

patrickamadeus mentioned this pull request May 22, 2024

New schema: Add chat schema #679

Merged

SamuelCahyawijaya added bonus +3 and removed bonus +2 labels May 26, 2024

sabilmakbar reviewed May 31, 2024

View reviewed changes

github-actions bot added the need-fu-pr label Jun 15, 2024

github-actions bot removed the need-fu-pr label Jul 9, 2024

github-actions bot added the need-fu-pr label Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #536 | Add/Update Dataloader Onto4All #635

Closes #536 | Add/Update Dataloader Onto4All #635

patrickamadeus commented Apr 9, 2024

SamuelCahyawijaya left a comment

patrickamadeus commented May 13, 2024

holylovenia commented May 13, 2024

SamuelCahyawijaya commented May 13, 2024 •

edited

Loading

holylovenia commented May 13, 2024

SamuelCahyawijaya commented May 13, 2024

patrickamadeus commented May 13, 2024

patrickamadeus commented May 13, 2024

holylovenia commented May 13, 2024

patrickamadeus commented May 19, 2024

holylovenia commented May 21, 2024

patrickamadeus commented May 21, 2024

holylovenia commented May 21, 2024

patrickamadeus commented May 22, 2024

sabilmakbar commented May 23, 2024

holylovenia commented May 28, 2024

holylovenia commented May 30, 2024

sabilmakbar May 31, 2024 •

edited

Loading

holylovenia commented Jul 8, 2024

Closes #536 | Add/Update Dataloader Onto4All #635

Are you sure you want to change the base?

Closes #536 | Add/Update Dataloader Onto4All #635

Conversation

patrickamadeus commented Apr 9, 2024

Checkbox

TESTS

NOTES

SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

patrickamadeus commented May 13, 2024

holylovenia commented May 13, 2024

SamuelCahyawijaya commented May 13, 2024 • edited Loading

holylovenia commented May 13, 2024

SamuelCahyawijaya commented May 13, 2024

patrickamadeus commented May 13, 2024

patrickamadeus commented May 13, 2024

holylovenia commented May 13, 2024

patrickamadeus commented May 19, 2024

holylovenia commented May 21, 2024

patrickamadeus commented May 21, 2024

holylovenia commented May 21, 2024

patrickamadeus commented May 22, 2024

sabilmakbar commented May 23, 2024

holylovenia commented May 28, 2024

holylovenia commented May 30, 2024

sabilmakbar May 31, 2024 • edited Loading

Choose a reason for hiding this comment

holylovenia commented Jul 8, 2024

SamuelCahyawijaya commented May 13, 2024 •

edited

Loading

sabilmakbar May 31, 2024 •

edited

Loading