allows to pass Relation to run/extract method in Pipeline #3272

rudolfix · 2025-10-31T11:16:40Z

Description

Now relation can be passed like any other data to pipeline and to dlt.resource. Also

I fixed an error in dlt-ibis backend (limit could be a string)
I allowed any data to be passed to dlt.resource in typing (which was the case for a long time)

cloudflare-workers-and-pages · 2025-10-31T11:17:19Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	docs	`3964f20`	Commit Preview URL Branch Preview URL	Oct 31 2025, 11:17 AM

zilto · 2025-10-31T13:46:09Z

For our follow-up discussion, I think this is roughly the algorithm.

This may seem like a lot, but the details will be particularly important if we want to execute transformation DAG without round-trips between memory and backend

flowchart TD
    Start([Pipeline Execution Starts]) --> Execution

    Execution[Execute on-backend] --> CheckDataset{relation.dataset == pipeline.dataset}
    
    CheckDataset -->|No| ReadTable[Read table]
    CheckDataset -->|Yes| CheckExec{Relation method called}
    
    CheckExec -->|.arrow| ReadTable[Read table]
    CheckExec -->|.iter_arrow| ReadBatch[Read batch]
    CheckExec -->|None| Write
    
    ReadBatch --> WriteBatch{Configured batch write}
    ReadTable --> WriteBatch{Configured batch write}
    
    WriteBatch -->|No| WriteTable[Write table]
    WriteBatch -->|Yes| WriteBatches[Write batches]
    
    WriteTable --> Write[Write data to dataset]
    WriteBatches --> Write
    
    Write --> End([Load completed])

Scenario 1

Param:

Input relation
same pipeline
same destination
same dataset
Output:
execute SQL on destination
data stays on backend
writes to same dataset

pipeline = dlt.pipeline("ingest")
pipeline.run(rest_source())

dataset = pipeline.dataset()
relation = dataset.table("foo").select("a", "b", "c")

pipeline.run(relation, table_name="rel_table")

Scenario 2

Same as Scenario 1, but user wants to force in-memory execute (there's no good reason here). This must be done explicitly in code by calling relation.arrow()
Input:

arrow table
same pipeline
same destination
same dataset
Output:
execute SQL on destination
data loaded in memory
writes to same dataset

# ibidem
relation = dataset.table("foo").select("a", "b", "c")

pipeline.run(relation.arrow(), table_name="rel_table")

Scenario 3

Same as scenario 1, but we use a different write pipeline and dataset for namespacing. Same destination.
Param:

Input relation
same destination
different pipeline
different dataset
Output:
execute SQL on destination
data stays on backend
writes to different dataset

destination = dlt.destinations.duckdb("duck.duckdb")
pipeline = dlt.pipeline("ingest", destination=destination)
pipeline.run(rest_source())

dataset = pipeline.dataset()
relation = dataset.table("foo").select("a", "b", "c")

other_pipeline = dlt.pipeline("transform", destination=destination)
other_pipeline.run(relation, table_name="rel_table")

Scenario 4

Now, our second pipeline writes to a different destination and dataset. Data must be loaded in memory.
Param:

Input relation
different destination
different pipeline
different dataset
Output:
execute SQL on destination
data loaded in memory
writes to different destination and dataset

destination = dlt.destinations.duckdb("duck.duckdb")
pipeline = dlt.pipeline("ingest", destination=destination)
pipeline.run(rest_source())

dataset = pipeline.dataset()
relation = dataset.table("foo").select("a", "b", "c")

other_destination = dlt.destinations.bigquery()
other_pipeline = dlt.pipeline("transform", destination=other_destination)
other_pipeline.run(relation, table_name="rel_table")

Scenario 5

In scenarios where data is loaded in memory, the batch write vs full table write is configured via the dlt chunk_size config. Nothing needs to be explicitly passed in Python. Reusing Scenario 4

Input:

arrow table
same pipeline
same destination
same dataset
Output:
execute SQL on destination
data loaded in memory
writes to same dataset
writes in batch

# ibidem
# because of config.toml, relation will read the full table and write in batch
other_pipeline.run(relation, table_name="rel_table")

Scenario 6

In scenarios where data is loaded in memory, it can be desirable to read in batch instead of loading the full table. This needs to be explicitly done via Python. pipeline.run() should receive an iterator instead of a relation. It's important to make this explicit because downstream operations on full table vs. chunks matters (e.g., computing aggregates). The batch writer is independent and will properly collect multiple read batches if required

Reusing scenario 3.
Input:

arrow batch iterator
same pipeline
same destination
same dataset
Output:
execute SQL on destination
data loaded in memory as batches
writes to same dataset

# ibidem
relation = dataset.table("foo").select("a", "b", "c")

pipeline.run(relation.iter_arrow(), table_name="rel_table")

rudolfix added 3 commits October 31, 2025 12:00

fixes default limit in ibis backend

cd9afbf

fixes dlt.resource overload for generic data

abd97b0

extracts relation via arrow

3964f20

rudolfix requested a review from zilto October 31, 2025 11:16

anuunchin assigned rudolfix Oct 31, 2025

rudolfix marked this pull request as draft October 31, 2025 13:05

rudolfix mentioned this pull request Nov 23, 2025

feat: Dataset.write() #3092

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

allows to pass Relation to run/extract method in Pipeline #3272

allows to pass Relation to run/extract method in Pipeline #3272

Uh oh!

rudolfix commented Oct 31, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Oct 31, 2025

Uh oh!

zilto commented Oct 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

allows to pass Relation to run/extract method in Pipeline #3272

Are you sure you want to change the base?

allows to pass Relation to run/extract method in Pipeline #3272

Uh oh!

Conversation

rudolfix commented Oct 31, 2025

Description

Uh oh!

cloudflare-workers-and-pages bot commented Oct 31, 2025

Deploying with Cloudflare Workers

Uh oh!

zilto commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Scenario 6

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zilto commented Oct 31, 2025 •

edited

Loading