Skip to content

Conversation

@rudolfix
Copy link
Collaborator

Description

Now relation can be passed like any other data to pipeline and to dlt.resource. Also

  • I fixed an error in dlt-ibis backend (limit could be a string)
  • I allowed any data to be passed to dlt.resource in typing (which was the case for a long time)

@rudolfix rudolfix requested a review from zilto October 31, 2025 11:16
@cloudflare-workers-and-pages
Copy link

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 3964f20 Commit Preview URL

Branch Preview URL
Oct 31 2025, 11:17 AM

@rudolfix rudolfix marked this pull request as draft October 31, 2025 13:05
@zilto
Copy link
Collaborator

zilto commented Oct 31, 2025

For our follow-up discussion, I think this is roughly the algorithm.

This may seem like a lot, but the details will be particularly important if we want to execute transformation DAG without round-trips between memory and backend

flowchart TD
    Start([Pipeline Execution Starts]) --> Execution

    Execution[Execute on-backend] --> CheckDataset{relation.dataset == pipeline.dataset}
    
    CheckDataset -->|No| ReadTable[Read table]
    CheckDataset -->|Yes| CheckExec{Relation method called}
    
    CheckExec -->|.arrow| ReadTable[Read table]
    CheckExec -->|.iter_arrow| ReadBatch[Read batch]
    CheckExec -->|None| Write
    
    ReadBatch --> WriteBatch{Configured batch write}
    ReadTable --> WriteBatch{Configured batch write}
    
    WriteBatch -->|No| WriteTable[Write table]
    WriteBatch -->|Yes| WriteBatches[Write batches]
    
    WriteTable --> Write[Write data to dataset]
    WriteBatches --> Write
    
    Write --> End([Load completed])
Loading

Scenario 1

Param:

  • Input relation
  • same pipeline
  • same destination
  • same dataset
    Output:
  • execute SQL on destination
  • data stays on backend
  • writes to same dataset
pipeline = dlt.pipeline("ingest")
pipeline.run(rest_source())

dataset = pipeline.dataset()
relation = dataset.table("foo").select("a", "b", "c")

pipeline.run(relation, table_name="rel_table")

Scenario 2

Same as Scenario 1, but user wants to force in-memory execute (there's no good reason here). This must be done explicitly in code by calling relation.arrow()
Input:

  • arrow table
  • same pipeline
  • same destination
  • same dataset
    Output:
  • execute SQL on destination
  • data loaded in memory
  • writes to same dataset
# ibidem
relation = dataset.table("foo").select("a", "b", "c")

pipeline.run(relation.arrow(), table_name="rel_table")

Scenario 3

Same as scenario 1, but we use a different write pipeline and dataset for namespacing. Same destination.
Param:

  • Input relation
  • same destination
  • different pipeline
  • different dataset
    Output:
  • execute SQL on destination
  • data stays on backend
  • writes to different dataset
destination = dlt.destinations.duckdb("duck.duckdb")
pipeline = dlt.pipeline("ingest", destination=destination)
pipeline.run(rest_source())

dataset = pipeline.dataset()
relation = dataset.table("foo").select("a", "b", "c")

other_pipeline = dlt.pipeline("transform", destination=destination)
other_pipeline.run(relation, table_name="rel_table")

Scenario 4

Now, our second pipeline writes to a different destination and dataset. Data must be loaded in memory.
Param:

  • Input relation
  • different destination
  • different pipeline
  • different dataset
    Output:
  • execute SQL on destination
  • data loaded in memory
  • writes to different destination and dataset
destination = dlt.destinations.duckdb("duck.duckdb")
pipeline = dlt.pipeline("ingest", destination=destination)
pipeline.run(rest_source())

dataset = pipeline.dataset()
relation = dataset.table("foo").select("a", "b", "c")

other_destination = dlt.destinations.bigquery()
other_pipeline = dlt.pipeline("transform", destination=other_destination)
other_pipeline.run(relation, table_name="rel_table")

Scenario 5

In scenarios where data is loaded in memory, the batch write vs full table write is configured via the dlt chunk_size config. Nothing needs to be explicitly passed in Python. Reusing Scenario 4

Input:

  • arrow table
  • same pipeline
  • same destination
  • same dataset
    Output:
  • execute SQL on destination
  • data loaded in memory
  • writes to same dataset
  • writes in batch
# ibidem
# because of config.toml, relation will read the full table and write in batch
other_pipeline.run(relation, table_name="rel_table")

Scenario 6

In scenarios where data is loaded in memory, it can be desirable to read in batch instead of loading the full table. This needs to be explicitly done via Python. pipeline.run() should receive an iterator instead of a relation. It's important to make this explicit because downstream operations on full table vs. chunks matters (e.g., computing aggregates). The batch writer is independent and will properly collect multiple read batches if required

Reusing scenario 3.
Input:

  • arrow batch iterator
  • same pipeline
  • same destination
  • same dataset
    Output:
  • execute SQL on destination
  • data loaded in memory as batches
  • writes to same dataset
# ibidem
relation = dataset.table("foo").select("a", "b", "c")

pipeline.run(relation.iter_arrow(), table_name="rel_table")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants