Skip to content

Commit 3b420bf

Browse files
Eren Chenyang Zhaozhaochen20neubigviswavi
authored
Add colab demo (#290)
* add demo.ipynb * add avatar * new readme * new readme * new link * Revisions to the colab demo (#293) * Update instructions for retrieval (#295) * add instruction to create .env * fix event loop already running in asyncio * Add process dataset list (#298) * add process_generated_and_retrieved_datasets * change parameters * change notebook * add variable names * lint --------- Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> * Fix some wording and asyncio (#297) * Fix some wording and asyncio * Update notebook_demo.ipynb Co-authored-by: Eren Chenyang Zhao <chenyan3@andrew.cmu.edu> --------- Co-authored-by: Eren Chenyang Zhao <chenyan3@andrew.cmu.edu> * Add wrap input (#300) * add wrap_input * add wrap_input * Update notebook_demo.ipynb Co-authored-by: Graham Neubig <neubig@gmail.com> --------- Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> Co-authored-by: Graham Neubig <neubig@gmail.com> * Wording modifications to notebook demo (#299) * Made some modifications to wording * Revert batch size * Small modifications * Rename demo files to prompt2model_demo (#307) * Make a directory for the dataset retriever * Squash a few bugs * Mention A100 GPUs * Increase executor batch size * Fix typo * Fix bug in try it out * Update prompt2model_demo.ipynb * Update tests/dataset_processor_test.py Co-authored-by: Vijay Viswanathan <vijayv@andrew.cmu.edu> --------- Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> Co-authored-by: Graham Neubig <neubig@gmail.com> Co-authored-by: Vijay Viswanathan <vijayv@andrew.cmu.edu>
1 parent dcac753 commit 3b420bf

File tree

11 files changed

+914
-56
lines changed

11 files changed

+914
-56
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ __pycache__
22
build
33
dist
44
prompt2model.egg-info
5+
.env
56
.vscode
67
.mypy_cache
78
.pytest_cache
@@ -17,6 +18,11 @@ tests/wandb
1718
cached_generated_dataset/
1819
generated_dataset/
1920
huggingface_data/huggingface_datasets/dataset_index.json
21+
huggingface_data/huggingface_datasets/huggingface_datasets_datafinder_index
2022
huggingface_data/huggingface_models/
2123
retrieved_dataset_dict/
2224
status.yaml
25+
26+
# Outputs generated by the colab demo
27+
trained_model/
28+
trained_tokenizer/

README.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
# prompt2model - Generate Deployable Models from Instructions
1+
# Prompt2Model - Generate Deployable Models from Instructions
22

33
[![PyPI version](https://badge.fury.io/py/prompt2model.svg)](https://badge.fury.io/py/prompt2model)
44
![Github Actions CI tests](https://github.com/neulab/prompt2model/actions/workflows/ci.yml/badge.svg)
55
[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)
66
[![Discord](https://img.shields.io/discord/1144245269001678959)](https://discord.gg/UCy9csEmFc)
7+
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neulab/prompt2model/blob/main/prompt2model_demo.ipynb)
78

89
`Prompt2Model` is a system that takes a natural
910
language task description (like the prompts used for
@@ -14,11 +15,22 @@ special-purpose model that is conducive for deployment.
1415

1516
## Quick Start
1617

18+
### Notebook
19+
20+
You can run our demo of `Prompt2Model` through a notebook:
21+
22+
- [Open Locally](./prompt2model_demo.ipynb)
23+
- [Open in Colab](https://colab.research.google.com/github/neulab/prompt2model/blob/main/prompt2model_demo.ipynb)
24+
25+
### Command Line
26+
27+
You can also run through the command line.
28+
1729
```bash
1830
pip install prompt2model
1931
```
2032

21-
Our current `prompt2model` implementation uses
33+
Our current `Prompt2Model` implementation uses
2234
the OpenAI API. Accordingly, you need to:
2335

2436
- Sign up on the OpenAI website and obtain an
@@ -36,11 +48,10 @@ export OPENAI_API_KEY=<your key>
3648
You can then run
3749

3850
```bash
39-
python cli_demo.py
51+
python prompt2model_demo.py
4052
```
4153

42-
to
43-
create a small model from a prompt, as shown in
54+
to create a small model from a prompt, as shown in
4455
the demo video below. This script must be run on a
4556
device with an internet connection to access the OpenAI
4657
API. For best results, run

prompt2model/dataset_generator/openai_gpt.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
from dataclasses import dataclass
1212
from pathlib import Path
1313

14+
import nest_asyncio
1415
import openai
1516
from datasets import Dataset
1617
from tqdm import tqdm
@@ -26,6 +27,7 @@
2627
handle_openai_error,
2728
)
2829

30+
nest_asyncio.apply()
2931
logger = get_formatted_logger("DatasetGenerator")
3032

3133

prompt2model/dataset_processor/base.py

Lines changed: 110 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ def __init__(self, has_encoder: bool, eos_token: str | None = None) -> None:
2626

2727
@staticmethod
2828
@abstractmethod
29-
def post_process_example(
29+
def _post_process_example(
3030
example: dict,
3131
instruction: str,
3232
task_id: int,
@@ -83,13 +83,13 @@ def filter_empty_strings(example: dict) -> bool:
8383
"input_col" in example and "output_col" in example
8484
), "Example dictionary must have 'input_col' and 'output_col' keys."
8585
# Check if 'input_col' and 'output_col' are both non-empty strings
86-
return bool(example["input_col"]) and bool(example["output_col"])
86+
return bool(str(example["input_col"])) and bool(str(example["output_col"]))
8787

8888
for task_id, dataset_dict in enumerate(dataset_dicts):
8989
modified_dataset_dict = {}
9090
for dataset_split in list(dataset_dict.keys()):
9191
mapping_function = partial(
92-
self.post_process_example,
92+
self._post_process_example,
9393
instruction=instruction,
9494
task_id=task_id,
9595
has_encoder=self.has_encoder,
@@ -104,3 +104,110 @@ def filter_empty_strings(example: dict) -> bool:
104104
modified_dataset_dict = datasets.DatasetDict(modified_dataset_dict)
105105
modified_dataset_dicts.append(modified_dataset_dict)
106106
return modified_dataset_dicts
107+
108+
@staticmethod
109+
def _split_dataset_into_dataset_dict(
110+
dataset,
111+
train_proportion: float = 0.8,
112+
val_proportion: float = 0.1,
113+
maximum_example_num: int | None = None,
114+
) -> datasets.DatasetDict:
115+
"""Split a given dataset into `train`, `val`, and `test` splits.
116+
117+
This function takes a dataset and splits it based on specified
118+
proportions for train, val and test. It respects a maximum
119+
number of examples to be included in each set, if specified.
120+
121+
Args:
122+
dataset: The original dataset to be split.
123+
train_proportion: Proportion of examples for the `train` set.
124+
val_proportion: Proportion of examples for the `val` set.
125+
maximum_example_num: Maximum number of examples
126+
to include in each set.
127+
128+
Returns:
129+
datasets.DatasetDict: A dictionary containing the `train`,
130+
`val`, and `test` datasets.
131+
"""
132+
num_of_examples = len(dataset)
133+
train_num = int(train_proportion * num_of_examples)
134+
val_num = int(val_proportion * num_of_examples)
135+
test_num = num_of_examples - train_num - val_num
136+
137+
if maximum_example_num is not None:
138+
train_num = min(train_num, maximum_example_num)
139+
val_num = min(val_num, maximum_example_num)
140+
test_num = min(test_num, maximum_example_num)
141+
142+
train_dataset = datasets.Dataset.from_dict(dataset[:train_num])
143+
val_dataset = datasets.Dataset.from_dict(
144+
dataset[train_num : train_num + val_num]
145+
)
146+
test_dataset = datasets.Dataset.from_dict(
147+
dataset[train_num + val_num : train_num + val_num + test_num]
148+
)
149+
150+
dataset_dict = datasets.DatasetDict(
151+
{"train": train_dataset, "val": val_dataset, "test": test_dataset}
152+
)
153+
return dataset_dict
154+
155+
@staticmethod
156+
def wrap_single_input(instruction: str, input: str):
157+
"""Wrap an input string into text2text fashion to be the input of model.
158+
159+
Args:
160+
instruction: The instruction used as a prefix to explain the task.
161+
input: An input string to be wrapped.
162+
163+
Return:
164+
A wrapped input string.
165+
"""
166+
return f"<task 0>{instruction}\nExample:\n{input}\nLabel:\n"
167+
168+
def process_dataset_lists(
169+
self,
170+
instruction: str,
171+
dataset_list: list[datasets.Dataset],
172+
train_proportion: float = 0.8,
173+
val_proportion: float = 0.1,
174+
maximum_example_num: int | None = None,
175+
) -> list[datasets.DatasetDict]:
176+
"""Post-processes both the generated and retrieved datasets.
177+
178+
This function takes in datasets generated by `DatasetGenerator`
179+
and retrieved by `DatasetRetriever`. It modifies these datasets
180+
based on a given instruction, converting all examples into a
181+
text-to-text format.
182+
183+
Args:
184+
instruction: The instruction used as a prefix to explain the task.
185+
dataset_list: A list of datasets. It can be either generated by
186+
the DatasetGenerator or retrieved by the DatasetRetriever.
187+
train_proportion: The proportion of examples used for `train`.
188+
val_proportion: The proportion of examples used for `val`.
189+
maxium_example_num: The maximum number of examples to
190+
be used for `train`, `val` and `test`.
191+
192+
Returns:
193+
list[datasets.DatasetDict]: A list of DatasetDicts, all examples
194+
are converted into text2text fashion.
195+
196+
Note:
197+
The DatasetRetriever returns a DatasetDict with multiple splits.
198+
Any of these splits can be passed into this function.
199+
The remaining proportion after allocating to `train` and
200+
`val` will be used for the `test` set.
201+
"""
202+
if train_proportion + val_proportion >= 1:
203+
raise ValueError(
204+
f"train_proportion {train_proportion} + val_proportion {val_proportion} must be less than 1." # noqa E501
205+
)
206+
207+
dataset_dicts = [
208+
self._split_dataset_into_dataset_dict(
209+
each, train_proportion, val_proportion, maximum_example_num
210+
)
211+
for each in dataset_list
212+
]
213+
return self.process_dataset_dict(instruction, dataset_dicts)

prompt2model/dataset_processor/mock.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ def process_dataset_dict(
2424
return dataset_dicts
2525

2626
@staticmethod
27-
def post_process_example(
27+
def _post_process_example(
2828
example: dict,
2929
instruction: str,
3030
task_id: int,

prompt2model/dataset_processor/textualize.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ def __init__(self, has_encoder: bool, eos_token: str | None = None) -> None:
4141
)
4242

4343
@staticmethod
44-
def post_process_example(
44+
def _post_process_example(
4545
example: dict,
4646
instruction: str,
4747
task_id: int,

prompt2model/demo_creator/create.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import gradio as gr
44
import mdtex2html
55

6+
from prompt2model.dataset_processor import TextualizeProcessor
67
from prompt2model.model_executor import GenerationModelExecutor
78
from prompt2model.prompt_parser import OpenAIInstructionParser
89

@@ -35,7 +36,12 @@ def postprocess(self, y):
3536

3637
gr.Chatbot.postprocess = postprocess
3738

38-
def response(message):
39+
def response(message: str):
40+
if not message.startswith("<task 0>"):
41+
dataset_processor = TextualizeProcessor(has_encoder=True)
42+
message = dataset_processor.wrap_single_input(
43+
prompt_parser.instruction, message
44+
)
3945
response = model_executor.make_single_prediction(message)
4046
prediction = response.prediction
4147
return prediction

0 commit comments

Comments
 (0)