Datasets #2

rbroc · 2023-05-17T11:53:10Z

(All these tasks will probably require prompt engineering, model-specific. Consider doing evaluation, either through external metrics or human validation)

Number of examples per dataset: cap at 5000 (expand if possible)

Paraphrasing:
MRPC: paraphrases, 5,801 examples

Summarizaton:
DailyMail / CNN: 300,000 examples (sample 5 for iteration)

Dialogue:
DailyDialog: multi-turn dialogues (15k).
Approach:

Randomly sample number of turns fed as context to the model.
Iteratively pass examples with increasing number of turns as context

Socratic Questions (context - smart question (HG) - AI generated (HG))
https://aclanthology.org/2023.eacl-main.12.pdf

Story Generation
GitHub: One GitHub for story generation: https://github.com/facebookresearch/fairseq/blob/main/examples/stories/README.md
Kaggle: Writing Prompts
https://www.kaggle.com/datasets/ratthachat/writing-prompts

Additional datasets
GEM: https://aclanthology.org/2021.gem-1.10.pdf

rbroc · 2024-10-23T23:08:56Z

I've taken a look at additional datasets, and I think in terms of paraphrase, dialogue generation and story generation, I think we can run with the datasets we have. In theory, if we wanted to make bigger claims about differences between LLMs and humans for each task, we would have to have multiple datasets per task. For summarization, this is a possibility, and we could in theory consider adding https://huggingface.co/datasets/EdinburghNLP/xsum or https://huggingface.co/datasets/Samsung/samsum because they come with detailed instructions for humans that can be also provided to models.

Yet, I think we should try to wrap up the project and write it up with what we have at the moment. I am leaving this open for now just in case, but the only action needed might be to use a better prompt for summarization, to simulate the highlight-like behavior. We could also consider adding a few examples from the dataset to illustrate things.

rbroc · 2024-10-23T23:35:32Z

Regarding prompts, a few suggestions for improvements in the final version. For summarization, I would go for something along the lines of: Summarise the following news article. Provide your summary in the form of highlights. If models do not comply, we can provide an example, adding:

Here is an example:

Text: {text_1}
Summary: {summary_1}

Text: {target_text}
Summary:

For stories, I would go for something like: Write a short story based on this writing prompt.

For paraphrase and dialogue, I think we are good.

rbroc changed the title ~~Source datasets for NLG~~ Datasets Oct 14, 2024

rbroc self-assigned this Oct 14, 2024

rbroc added the high-priority label Oct 14, 2024

rbroc mentioned this issue Oct 23, 2024

Project Overview #79

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets #2

Datasets #2

rbroc commented May 17, 2023 •

edited

Loading

rbroc commented Oct 23, 2024 •

edited

Loading

rbroc commented Oct 23, 2024 •

edited

Loading

Datasets #2

Datasets #2

Comments

rbroc commented May 17, 2023 • edited Loading

rbroc commented Oct 23, 2024 • edited Loading

rbroc commented Oct 23, 2024 • edited Loading

rbroc commented May 17, 2023 •

edited

Loading

rbroc commented Oct 23, 2024 •

edited

Loading

rbroc commented Oct 23, 2024 •

edited

Loading