-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #526 | Add dataloader for WIkiHow-GOSC #674
Conversation
"title": datasets.Value("string"), | ||
"category": datasets.Value("string"), | ||
"sections": datasets.Sequence({"section": datasets.Value("string"), "steps": datasets.Sequence(datasets.Value("string")), "ordered": datasets.Value("int32")}), | ||
"ordered": datasets.Value("int32"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @elyanah-aco, I checked this data, and some of the examples don't have any sections
key (probably single-sectioned data).
This is an example of the key list of the data and (note: one-based index since I enumerate it from 1).
Do you mind making some readjustments on the _generate_examples
to cater to such cases? I'm thinking of transforming such data into a section field with a single-valued list (let me know if you have a better workaround). Thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sabilmakbar I'm thinking like this:
"sections": ["section": "", "steps": ['" Mulailah membuat situs web.", "Gunakan Twitter."], "ordered": 1],
"ordered": 1
where ordered
inside and outside sections
are the same. ordered
inside sections indicates if steps inside that section are ordered, and ordered
outside indicates if sections themselves are ordered. They should be the same if there's just one section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the outer ordered
, it should be valued as 1, even if the inner section ordered
value is 0 (the one coming from actual data) because it's a single-valued list anyway (indicates the list contents is ordered for sections
data). The ordered
in inner section (for section
data, singleton list) will follow the initial ordered
value from actual example. Wdyt?
If we're taking the example from prev ss (but the ordered
value is 0 instead of 1), we can construct it like this:
# the `ordered` value in sections list indicatest the actual `ordered` state on the `steps`
"sections": ["section": "", "steps": ['" Mulailah membuat situs web.", "Gunakan Twitter."], "ordered": 0],
"ordered": 1 # the value indicates the ordering in `sections` is ordered (which is 1 for singleton list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup you're correct. Will work on this tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! will merge in a few :)
Closes #526.
Notes:
gdown
is used to download from GDrive, see Closes #206 | Add Dataloader SleukRith Set #556 and Closes #531 | Add Dataloader Multilingual-ALPACA #566 for similar implementations.source
schema is implemented is done after discussing with Holy.Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.