Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #536 | Add/Update Dataloader Onto4All #635

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

patrickamadeus
Copy link
Collaborator

Closes #536

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
  • [.] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

TESTS

image

NOTES

Please use huggingface-cli or insert your API_KEY to token= parameter in load_dataset method since this is a gated dataset 😄

Copy link
Collaborator

@SamuelCahyawijaya SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @patrickamadeus, thank you for contributing! I found two problems in this dataloader:

  1. Somehow the conversation data is empty on both _source and _seacrowd_* schemas
Screenshot 2024-05-02 at 8 45 56 AM
  1. For the _seacrowd_* schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

@patrickamadeus
Copy link
Collaborator Author

Hi @patrickamadeus, thank you for contributing! I found two problems in this dataloader:

  1. Somehow the conversation data is empty on both _source and _seacrowd_* schemas
Screenshot 2024-05-02 at 8 45 56 AM 2. For the `_seacrowd_*` schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

Hi @SamuelCahyawijaya ! Thank you for the review 😄 .

Please kindly check the latest commit for the fix.

@holylovenia
Copy link
Contributor

  1. For the _seacrowd_* schema, this is the first time we use QA for chat data. It seems to fit well with the data, perhaps @holylovenia have any feedback on this?

Sorry for the late reply, I missed this mention. Is this supposedly to accommodate a multi-turn chat template with the user, assistant, and system roles?

While this qa schema seems to fit the dataset well, I think it's better if this dataloader has a different task (e.g., MULTI_TURN_CONVERSATION) with a new schema (with a messages variable like this) to facilitate similar datasets in the future. It will prevent this dataset from being overlooked as another QA task too.

What do you think, @patrickamadeus @SamuelCahyawijaya @yongzx?

cc: @sabilmakbar

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented May 13, 2024

@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?

the schema would be basically consists of input, output, and meta.

  • input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }
  • output would be the expected response of the model, in this case it would be the last turn of conversation from gpt
  • meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

@holylovenia
Copy link
Contributor

should we propose the new schema and adjust the score accordingly?

I'm of this opinion.

the schema would be basically consists of input, output, and meta.

  • input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }
  • output would be the expected response of the model, in this case it would be the last turn of conversation from gpt
  • meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

Let's normalize it for the seacrowd schema.

@SamuelCahyawijaya
Copy link
Collaborator

@patrickamadeus : would it be ok for you to create the new schema, and adjust the dataloader accordingly?

@patrickamadeus
Copy link
Collaborator Author

With pleasure @SamuelCahyawijaya !

@holylovenia @patrickamadeus @yongzx @sabilmakbar @patrickamadeus : I kinda agree with the chat format as it is more standardized and also supported in the HuggingFace. In this case, should we propose the new schema and adjust the score accordingly?

the schema would be basically consists of input, output, and meta.

  • input would be in a form of list of dictionary {"role": "<ROLE>", "content": "<CONTENT>" }
  • output would be the expected response of the model, in this case it would be the last turn of conversation from gpt
  • meta can be used for storing other information, like type in this case.

One question though, should we also normalize the <ROLE>? Like in this dataset, it use system, human, and gpt. Should it be standardized into something like system, user, and assistant or we keep it as is?

I will refer the schema from here for now.

@patrickamadeus
Copy link
Collaborator Author

Hi @SamuelCahyawijaya @holylovenia !

Could you please review the new schema and implementation? I named it chat feature for now, feel free to suggest any change!

@holylovenia
Copy link
Contributor

Hi @SamuelCahyawijaya @holylovenia !

Could you please review the new schema and implementation? I named it chat feature for now, feel free to suggest any change!

The schema looks great to me! Let us know if you've separated the schema and new task so we can approve it.

@patrickamadeus
Copy link
Collaborator Author

It's done @SamuelCahyawijaya @holylovenia .

@holylovenia
Copy link
Contributor

It's done @SamuelCahyawijaya @holylovenia .

Could you please link the PR for the new schema and task here, @patrickamadeus?

cc: @sabilmakbar because I'll put more focus on the experiments going forward.

@patrickamadeus
Copy link
Collaborator Author

Oops, sorry!

I put them altogether here in the last commit 😬 @holylovenia .

Should I create a separate PR for it? Sorry for my ignorance.

@holylovenia
Copy link
Contributor

Oops, sorry!

I put them altogether here in the last commit 😬 @holylovenia .

Should I create a separate PR for it? Sorry for my ignorance.

Yes, it'd be great if we could have a separate PR. Thanks in advance, @patrickamadeus!!

@patrickamadeus
Copy link
Collaborator Author

Hi! here is the chat schema PR #679 @sabilmakbar @holylovenia

@sabilmakbar
Copy link
Collaborator

Quick question: What's the difference between using this new chat schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.

@holylovenia
Copy link
Contributor

Quick question: What's the difference between using this new chat schema and TOD (since we already have it)? If I remember correctly, TOD is a multi-turn dialogue too. Hence, both should be similar in terms of schema.

TOD relies on belief state and system act apart from the utterances. In practice, most TOD works are a derivative of or follow the WOZ dataset's style, so it would be better to keep that schema for TOD.

@holylovenia
Copy link
Contributor

Hi @patrickamadeus, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @yongzx @SamuelCahyawijaya @sabilmakbar

{
"id": datasets.Value("string"),
"input": datasets.Sequence({
"role": datasets.ClassLabel(names=["system", "user", "assistant"]),
Copy link
Collaborator

@sabilmakbar sabilmakbar May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @SamuelCahyawijaya @yongzx just letting you know the changes on schema has been merged to master, but with this role field being changed to string (datasets.Value("string")) due to possibilities of additional/custom roles and HF mechanics that return an indices of the label for their examples had it been set as ClassLabel (which is less intuitive than string)

@holylovenia
Copy link
Contributor

Hi @patrickamadeus, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @yongzx @SamuelCahyawijaya @sabilmakbar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for Onto4All
6 participants