-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076
base: develop
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
for more information, see https://pre-commit.ci
Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1076/ |
CodSpeed Performance ReportMerging #1076 will improve performances by ×6Comparing Summary
Benchmarks breakdown
|
@burtenshaw can we get rid of the from datasets import Dataset
import wikipedia
from distilabel.pipeline import InstructionResponsePipeline
pipeline = InstructionResponsePipeline(num_instructions=5)
distiset = pipeline.pipeline.run(
use_cache=False,
dataset=Dataset.from_list(
[
{
"input": wikipedia.page(title="Transfer_learning").content,
}
]
),
) |
for more information, see https://pre-commit.ci
…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline
for more information, see https://pre-commit.ci
…github.com/argilla-io/distilabel into feat/dataset-instruction-response-pipeline
@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge. I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something. |
Thanks. I agree with those suggestions. I'll work on this next week. |
@burtenshaw perhaps you can add the documentation is this PR? |
Also, perhaps I like some more explicit naming like |
This is a continuation of this: #1059
It implements a pipeline abstraction template that runs on
SelfInstruct
step and text generation on a dataset of documents. This should help boot strap basic users to build SFT datasets.