Skip to content

Commit

Permalink
Image Language Models and ImageGeneration task (#1060)
Browse files Browse the repository at this point in the history
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Jan 15, 2025
1 parent e866345 commit 5257600
Show file tree
Hide file tree
Showing 164 changed files with 2,398 additions and 632 deletions.
10 changes: 10 additions & 0 deletions docs/api/models/image_generation/image_generation_gallery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# ImageGenerationModel Gallery

This section contains the existing [`ImageGenerationModel`][distilabel.models.image_generation] subclasses implemented in `distilabel`.

::: distilabel.models.image_generation
options:
filters:
- "!^ImageGenerationModel$"
- "!^AsyngImageGenerationModel$"
- "!typing"
7 changes: 7 additions & 0 deletions docs/api/models/image_generation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ImageGenerationModel

This section contains the API reference for the `distilabel` image generation models, both for the [`ImageGenerationModel`][distilabel.models.image_generation.ImageGenerationModel] synchronous implementation, and for the [`AsyncImageGenerationModel`][distilabel.models.image_generation.AsyncImageGenerationModel] asynchronous one.

For more information and examples on how to use existing LLMs or create custom ones, please refer to [Tutorial - ImageGenerationModel](../../../sections/how_to_guides/basic/task/image_task.md).

::: distilabel.models.image_generation.base
3 changes: 0 additions & 3 deletions docs/api/pipeline/typing.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/step/typing.md

This file was deleted.

7 changes: 7 additions & 0 deletions docs/api/task/image_task.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# ImageTask

This section contains the API reference for the `distilabel` image generation tasks.

For more information on how the [`ImageTask`][distilabel.steps.tasks.ImageTask] works and see some examples, check the [Tutorial - Task - ImageTask](../../sections/how_to_guides/basic/task/generator_task.md) page.

::: distilabel.steps.tasks.base.ImageTask
1 change: 1 addition & 0 deletions docs/api/task/task_gallery.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,6 @@ This section contains the existing [`Task`][distilabel.steps.tasks.Task] subclas
- "!Task"
- "!_Task"
- "!GeneratorTask"
- "!ImageTask"
- "!ChatType"
- "!typing"
3 changes: 0 additions & 3 deletions docs/api/task/typing.md

This file was deleted.

8 changes: 8 additions & 0 deletions docs/api/typing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Types

This section contains the different types used accross the distilabel codebase.

::: distilabel.typing.base
::: distilabel.typing.steps
::: distilabel.typing.models
::: distilabel.typing.pipeline
27 changes: 27 additions & 0 deletions docs/sections/how_to_guides/advanced/distiset.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,33 @@ class MagpieGenerator(GeneratorTask, MagpieBase):

The `Citations` section can include any number of bibtex references. To define them, you can add as much elements as needed just like in the example: each citation will be a block of the form: ` ```@misc{...}``` `. This information will be automatically used in the README of your `Distiset` if you decide to call `distiset.push_to_hub`. Alternatively, if the `Citations` is not found, but in the `References` there are found any urls pointing to `https://arxiv.org/`, we will try to obtain the `Bibtex` equivalent automatically. This way, Hugging Face can automatically track the paper for you and it's easier to find other datasets citing the same paper, or directly visiting the paper page.

#### Image Datasets

!!! info "Keep reading if you are interested in Image datasets"

The `Distiset` object has a new method `transform_columns_to_image` specifically to transform the images to `PIL.Image.Image` before pushing the dataset to the hugging face hub.

Since version `1.5.0` we have the [`ImageGeneration`](https://distilabel.argilla.io/dev/components-gallery/task/imagegeneration/) task that is able to generate images from text. By default, all the process will work internally with a string representation for the images. This is done for simplicity while processing. But to take advantage of the Hugging Face Hub functionalities if the dataset generated is going to be stored there, a proper Image object may be preferable, so we can see the images in the dataset viewer for example. Let's take a look at the following pipeline extracted from "examples/image_generation.py" at the root of the repository to see how we can do it:

```diff
# Assume all the imports are already done, we are only interested
with Pipeline(name="image_generation_pipeline") as pipeline:
img_generation = ImageGeneration(
name="flux_schnell",
llm=igm,
InferenceEndpointsImageGeneration(model_id="black-forest-labs/FLUX.1-schnell")
)
...

if __name__ == "__main__":
distiset = pipeline.run(use_cache=False, dataset=ds)
# Save the images as `PIL.Image.Image`
+ distiset = distiset.transform_columns_to_image("image")
distiset.push_to_hub(...)
```

After calling [`transform_columns_to_image`][distilabel.distiset.Distiset.transform_columns_to_image] on the image columns we may have generated (in this case we only want to transform the `image` column, but a list can be passed). This will apply to any leaf nodes we have in the pipeline, meaning if we have different subsets, the "image" column will be found in all of them, or we can pass a list of columns.

### Save and load from disk

Take into account that these methods work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card). You can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ from typing import List

from distilabel.steps import Step
from distilabel.steps.base import StepInput
from distilabel.steps.typing import StepOutput
from distilabel.typing import StepOutput
from distilabel.steps import LoadDataFromDicts
from distilabel.utils.requirements import requirements
from distilabel.pipeline import Pipeline
Expand Down
4 changes: 2 additions & 2 deletions docs/sections/how_to_guides/advanced/structured_generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The [`LLM`][distilabel.models.llms.LLM] has an argument named `structured_output
We will start with a JSON example, where we initially define a `pydantic.BaseModel` schema to guide the generation of the structured output.

!!! NOTE
Take a look at [`StructuredOutputType`][distilabel.steps.tasks.typing.StructuredOutputType] to see the expected format
Take a look at [`StructuredOutputType`][distilabel.typing.models.StructuredOutputType] to see the expected format
of the `structured_output` dict variable.

```python
Expand Down Expand Up @@ -139,7 +139,7 @@ For other LLM providers behind APIs, there's no direct way of accessing the inte
```

!!! Note
Take a look at [`InstructorStructuredOutputType`][distilabel.steps.tasks.typing.InstructorStructuredOutputType] to see the expected format
Take a look at [`InstructorStructuredOutputType`][distilabel.typing.models.InstructorStructuredOutputType] to see the expected format
of the `structured_output` dict variable.

The following is the same example you can see with `outlines`'s `JSON` section for comparison purposes.
Expand Down
8 changes: 4 additions & 4 deletions docs/sections/how_to_guides/basic/step/generator_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ from typing_extensions import override
from distilabel.steps import GeneratorStep

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, GeneratorStepOutput
from distilabel.typing import StepColumns, GeneratorStepOutput

class MyGeneratorStep(GeneratorStep):
instructions: List[str]
Expand Down Expand Up @@ -67,7 +67,7 @@ We can define a custom generator step by creating a new subclass of the [`Genera
The default signature for the `process` method is `process(self, offset: int = 0) -> GeneratorStepOutput`. The argument `offset` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.

!!! WARNING
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.steps.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.

=== "Inherit from `GeneratorStep`"

Expand All @@ -81,7 +81,7 @@ We can define a custom generator step by creating a new subclass of the [`Genera
from distilabel.steps import GeneratorStep

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, GeneratorStepOutput
from distilabel.typing import StepColumns, GeneratorStepOutput

class MyGeneratorStep(GeneratorStep):
instructions: List[str]
Expand All @@ -104,7 +104,7 @@ We can define a custom generator step by creating a new subclass of the [`Genera
from distilabel.steps import step

if TYPE_CHECKING:
from distilabel.steps.typing import GeneratorStepOutput
from distilabel.typing import GeneratorStepOutput

@step(outputs=[...], step_type="generator")
def CustomGeneratorStep(offset: int = 0) -> "GeneratorStepOutput":
Expand Down
6 changes: 3 additions & 3 deletions docs/sections/how_to_guides/basic/step/global_step.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.

!!! WARNING
For the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.steps.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
For the custom [`GlobalStep`][distilabel.steps.GlobalStep] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.

=== "Inherit from `GlobalStep`"

Expand All @@ -27,7 +27,7 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
from distilabel.steps import GlobalStep, StepInput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, StepOutput
from distilabel.typing import StepColumns, StepOutput

class CustomStep(Step):
@property
Expand Down Expand Up @@ -61,7 +61,7 @@ We can define a custom step by creating a new subclass of the [`GlobalStep`][dis
from distilabel.steps import StepInput, step

if TYPE_CHECKING:
from distilabel.steps.typing import StepOutput
from distilabel.typing import StepOutput

@step(inputs=[...], outputs=[...], step_type="global")
def CustomStep(inputs: StepInput) -> "StepOutput":
Expand Down
8 changes: 4 additions & 4 deletions docs/sections/how_to_guides/basic/step/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ from typing import TYPE_CHECKING
from distilabel.steps import Step, StepInput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, StepOutput
from distilabel.typing import StepColumns, StepOutput

class MyStep(Step):
@property
Expand Down Expand Up @@ -87,7 +87,7 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe
The default signature for the `process` method is `process(self, *inputs: StepInput) -> StepOutput`. The argument `inputs` should be respected, no more arguments can be provided, and the type-hints and return type-hints should be respected too because it should be able to receive any number of inputs by default i.e. more than one [`Step`][distilabel.steps.Step] at a time could be connected to the current one.

!!! WARNING
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.steps.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.
For the custom [`Step`][distilabel.steps.Step] subclasses to work properly with `distilabel` and with the validation and serialization performed by default over each [`Step`][distilabel.steps.Step] in the [`Pipeline`][distilabel.pipeline.Pipeline], the type-hint for both [`StepInput`][distilabel.steps.StepInput] and [`StepOutput`][distilabel.typing.StepOutput] should be used and not surrounded with double-quotes or imported under `typing.TYPE_CHECKING`, otherwise, the validation and/or serialization will fail.

=== "Inherit from `Step`"

Expand All @@ -98,7 +98,7 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe
from distilabel.steps import Step, StepInput

if TYPE_CHECKING:
from distilabel.steps.typing import StepColumns, StepOutput
from distilabel.typing import StepColumns, StepOutput

class CustomStep(Step):
@property
Expand Down Expand Up @@ -132,7 +132,7 @@ We can define a custom step by creating a new subclass of the [`Step`][distilabe
from distilabel.steps import StepInput, step

if TYPE_CHECKING:
from distilabel.steps.typing import StepOutput
from distilabel.typing import StepOutput

@step(inputs=[...], outputs=[...])
def CustomStep(inputs: StepInput) -> "StepOutput":
Expand Down
5 changes: 2 additions & 3 deletions docs/sections/how_to_guides/basic/task/generator_task.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ from typing import Any, Dict, List, Union
from typing_extensions import override

from distilabel.steps.tasks.base import GeneratorTask
from distilabel.steps.tasks.typing import ChatType
from distilabel.steps.typing import GeneratorOutput
from distilabel.typing import ChatType, GeneratorOutput


class MyCustomTask(GeneratorTask):
Expand Down Expand Up @@ -78,7 +77,7 @@ We can define a custom generator task by creating a new subclass of the [`Genera
from typing import Any, Dict, List, Union

from distilabel.steps.tasks.base import GeneratorTask
from distilabel.steps.tasks.typing import ChatType
from distilabel.typing import ChatType


class MyCustomTask(GeneratorTask):
Expand Down
Loading

0 comments on commit 5257600

Please sign in to comment.