Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] basic use of pipeline to generate SFT dataset from documents #1076

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

burtenshaw
Copy link
Contributor

@burtenshaw burtenshaw commented Dec 2, 2024

This is a continuation of this: #1059

It implements a pipeline abstraction template that runs on SelfInstruct step and text generation on a dataset of documents. This should help boot strap basic users to build SFT datasets.

from datasets import Dataset
import wikipedia
from distilabel.pipeline import DatasetInstructionResponsePipeline

pipeline = DatasetInstructionResponsePipeline(num_instructions=5)

distiset = pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link

github-actions bot commented Dec 2, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1076/

Copy link

codspeed-hq bot commented Dec 2, 2024

CodSpeed Performance Report

Merging #1076 will improve performances by ×6

Comparing feat/dataset-instruction-response-pipeline (a0c23f6) with develop (a3320ed)

Summary

⚡ 1 improvements

Benchmarks breakdown

Benchmark develop feat/dataset-instruction-response-pipeline Change
test_cache_time 3,354.8 ms 557.8 ms ×6

@burtenshaw burtenshaw marked this pull request as draft December 2, 2024 14:16
@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Dec 10, 2024

@burtenshaw can we get rid of the pipeline.pipeline.run? Also, perhaps we could limit the exposure to different classes with something like the following. Under the hood it can still use the same but we just use different arguments. WDYT?

from datasets import Dataset
import wikipedia
from distilabel.pipeline import InstructionResponsePipeline

pipeline = InstructionResponsePipeline(num_instructions=5)

distiset = pipeline.pipeline.run(
    use_cache=False,
    dataset=Dataset.from_list(
        [
            {
                "input": wikipedia.page(title="Transfer_learning").content,
            }
        ]
    ),
)

@burtenshaw burtenshaw marked this pull request as ready for review December 16, 2024 12:26
@davidberenstein1957
Copy link
Member

@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge.

I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something.

@burtenshaw
Copy link
Contributor Author

@burtenshaw I think it would be worth it to have a dedicated section on this somewhere in the docs, , after that and resolving the tests we should be able to merge.

I would add it to the quickstart and perhaps to the components gallery under "pipelines" or something more explicit like "ready-to-go pipelines" or something.

Thanks. I agree with those suggestions. I'll work on this next week.

@davidberenstein1957
Copy link
Member

@burtenshaw perhaps you can add the documentation is this PR?

@davidberenstein1957
Copy link
Member

Also, perhaps I like some more explicit naming like InstructionResponseFromDataPipeline or InstructionResponseFromSeedDataPipeline better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants