-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sbmaruf/project instruct data using psrc #1
base: main
Are you sure you want to change the base?
Conversation
@AmrMKayid Here is the complete PR. |
Closed it by accident! |
data/project_from_psrc.py
Outdated
@@ -0,0 +1,227 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why the name is project_from_psrc.py
? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Project from promptsource
. If it's not clear we can rename it as project_from_promptsource.py
. But I like the psrc
short form. :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think project_from_promptsource.py
is better, let's change it pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! :)
Co-authored-by: Amr Kayid <amrmkayid@gmail.com>
…:for-ai/instruct-multilingual into sbmaruf/project_instruct_data_using_psrc
data/project_from_psrc.py
Outdated
executor.map( | ||
export_dataset_func, | ||
[ | ||
prompted_sample_gen_io[0] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # dataset_output_dir | ||
[ | ||
prompted_sample_gen_io[1] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # dataset_name_or_path | ||
[ | ||
prompted_sample_gen_io[2] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # dataset_config | ||
[ | ||
prompted_sample_gen_io[3] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # psrc_prompt_template_signature | ||
[ | ||
prompted_sample_gen_io[4] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # prompt_template | ||
[ | ||
prompted_sample_gen_io[5] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # dataset | ||
[ | ||
prompted_sample_gen_io[6] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # args.add_source_metadata | ||
[ | ||
prompted_sample_gen_io[7] | ||
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | ||
], # args.highlight_variables | ||
), | ||
total=len(args.dataset_name_or_paths), | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please change this? 🥺
executor.map( | |
export_dataset_func, | |
[ | |
prompted_sample_gen_io[0] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # dataset_output_dir | |
[ | |
prompted_sample_gen_io[1] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # dataset_name_or_path | |
[ | |
prompted_sample_gen_io[2] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # dataset_config | |
[ | |
prompted_sample_gen_io[3] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # psrc_prompt_template_signature | |
[ | |
prompted_sample_gen_io[4] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # prompt_template | |
[ | |
prompted_sample_gen_io[5] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # dataset | |
[ | |
prompted_sample_gen_io[6] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # args.add_source_metadata | |
[ | |
prompted_sample_gen_io[7] | |
for prompted_sample_gen_io in prompted_sample_gen_io_tuple_list | |
], # args.highlight_variables | |
), | |
total=len(args.dataset_name_or_paths), | |
): | |
executor.map(export_dataset_func, zip(*prompted_sample_gen_io_tuple_list)), | |
total=len(args.dataset_name_or_paths), | |
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to write map because it was difficult to debug.
I think it should be executor.map(export_dataset_func, *zip(*prompted_sample_gen_io_tuple_list)),
Can you recheck?
Already updated that in the code.
data/project_from_psrc.py
Outdated
def xp3_export_dataset( | ||
dataset_output_dir: str, | ||
dataset_name: str, | ||
dataset_config: str, | ||
psrc_prompt_template_signature: str, | ||
prompt_template: Type[Template], | ||
dataset: Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset], | ||
add_source_metadata: bool = False, | ||
highlight_variables: bool = False, | ||
lang: str = 'en' | ||
) -> str: | ||
""" | ||
Given a `hf-dataset` (arg: dataset) and a prompt template (arg: prompt_template), | ||
project/transform samples from all the splits of dataset (arg: dataset) into an instruction format and | ||
writes in the disk (arg: dataset_output_dir) | ||
|
||
Args: | ||
dataset_output_dir (str): Path to the output directory where data will be saved. | ||
dataset_name (str): Name of the hf-dataset. | ||
dataset_config (str): Name of the hf-dataset config. | ||
psrc_prompt_template_signature (str): Name of the dataset & dataset-config for which prompts are written for. | ||
prompt_template (Type[Template]): Transformation/projection module that will take a sample from arg:dataset and transform it to an instruction. | ||
dataset (Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]): huggingface dataset that will be transformed into an instruction dataset. | ||
add_source_metadata (bool = False): If True, all the data column from the args:dataset will be saved as a meta information with the instruction dataset. | ||
add_source_metadata (bool = False): If True, prompt tokens and dataset tokens will be highlighted differently. This metadata will be saved as `highlighted_source` & `highlighted_target`. | ||
lang (str = 'en'): language name of the dataset | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbmaruf what is the difference between this method and export_dataset
I can see that both are very similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, I wrote export_dataset
where I exported all possible metadata while projecting data with templates. xp3_export_dataset
doesn't export all possible metadata and strictly follows xP3 format. Please note that the xP3 projection doesn't contain any metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left few comments for code quality improvements, otherwise LGTM, thanks @sbmaruf!
Feature
A sample script to run the code.
Output folder structure
Output folder of the above run.
tree $SRC_DATA_FOLDER/glue/
train
,validation
,test
(huggingface datasets split name).glue_cola.editing.jsonl
means, dataset is "glue", dataset config is "cola" and prompt name is "editing"glue_cola.editing.jsonl
) is a prompted sample.Output Format
A sample
json
data in thejsonl
file,The definition of each of the keys in the data,
jsonl
file containsjson
data which has a unique id within thejsonl
file. (datatype: string/int)psrc_prompt_template_signature
there could be many prompt templates.prompt_name
refers to each of those prompt templates. (datatype: string)sentence
. we save those data here. (datatype: from huggingface data source)label
. we save those data here. (datatype: from huggingface data source)Note: Different datasets may have a different number of "src_meta_*" keys. It depends on the original huggingface dataset columns.