Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets #2

Open
rbroc opened this issue May 17, 2023 · 2 comments
Open

Datasets #2

rbroc opened this issue May 17, 2023 · 2 comments
Assignees

Comments

@rbroc
Copy link
Owner

rbroc commented May 17, 2023

(All these tasks will probably require prompt engineering, model-specific. Consider doing evaluation, either through external metrics or human validation)

Number of examples per dataset: cap at 5000 (expand if possible)

Paraphrasing:
MRPC: paraphrases, 5,801 examples

Summarizaton:
DailyMail / CNN: 300,000 examples (sample 5 for iteration)

Dialogue:
DailyDialog: multi-turn dialogues (15k).
Approach:

  • Randomly sample number of turns fed as context to the model.
  • Iteratively pass examples with increasing number of turns as context

Socratic Questions (context - smart question (HG) - AI generated (HG))
https://aclanthology.org/2023.eacl-main.12.pdf

Story Generation
GitHub: One GitHub for story generation: https://github.com/facebookresearch/fairseq/blob/main/examples/stories/README.md
Kaggle: Writing Prompts
https://www.kaggle.com/datasets/ratthachat/writing-prompts

Additional datasets
GEM: https://aclanthology.org/2021.gem-1.10.pdf

@rbroc rbroc changed the title Source datasets for NLG Datasets Oct 14, 2024
@rbroc rbroc self-assigned this Oct 14, 2024
@rbroc
Copy link
Owner Author

rbroc commented Oct 23, 2024

I've taken a look at additional datasets, and I think in terms of paraphrase, dialogue generation and story generation, I think we can run with the datasets we have. In theory, if we wanted to make bigger claims about differences between LLMs and humans for each task, we would have to have multiple datasets per task. For summarization, this is a possibility, and we could in theory consider adding https://huggingface.co/datasets/EdinburghNLP/xsum or https://huggingface.co/datasets/Samsung/samsum because they come with detailed instructions for humans that can be also provided to models.

Yet, I think we should try to wrap up the project and write it up with what we have at the moment. I am leaving this open for now just in case, but the only action needed might be to use a better prompt for summarization, to simulate the highlight-like behavior. We could also consider adding a few examples from the dataset to illustrate things.

@rbroc
Copy link
Owner Author

rbroc commented Oct 23, 2024

Regarding prompts, a few suggestions for improvements in the final version. For summarization, I would go for something along the lines of: Summarise the following news article. Provide your summary in the form of highlights. If models do not comply, we can provide an example, adding:

Here is an example:

Text: {text_1}
Summary: {summary_1}

Text: {target_text}
Summary:

For stories, I would go for something like: Write a short story based on this writing prompt.

For paraphrase and dialogue, I think we are good.

@rbroc rbroc mentioned this issue Oct 23, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant