[CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum #254

davide-vito · 2022-12-21T16:41:26Z

Is this a new bug in dbt-redshift?

I believe this is a new bug in dbt-redshift
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Hello, we are trying to use dbt cloud with Redshfit Spectrum.

When we first build our incremental models with dbt run everything works fine, however when we run it again without the --full-refresh flag, we get the error Predicates not yet supported in nested s3 scan. This same error happens if we run the compiled incremental query on redshift too, suggesting i's not supported by Redshift Spectrum (see Redshift docs - Nested-data-restrictions).

Is there another way to make this work?

Expected Behavior

We expected to update the table with the latest modifications by date as summarised in the dbt docs for incremental models

Steps To Reproduce

In dbt cloud, using as an example the following model:

{{ config(
        materialized='incremental'
         ) 
}}

with A as (
    SELECT
        u.entity.id as id,
        CAST(u.last_updated as timestamp) as last_updated,
        u.entity.colA,
        u.entity.colB,
        u.entity.colC,
    FROM {{ source('test', 'test_table') }} u
)

SELECT * FROM A

{% if is_incremental() %}

  where last_updated > (select max(last_updated) from {{ this }})
  
{% endif %}

Relevant log output

Predicates not yet supported in nested s3 scan.

Environment

dbt cloud

Additional Context

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2022-12-22T14:11:57Z

Thanks for reaching out @davide-vito !

I agree with your assessment that this error message is specific to Redshift Spectrum and not something that dbt is doing.

Some ideas, some or all of which may be impractical in your specific use case:

Collaborate with the producer of the object in S3 to add last_updated as a top-level attribute (rather than nested)
Add a transformation step that writes the entire object to a regular table with unnested attributes first, then do a second step that performs the incremental logic

I haven't used Redshift Spectrum personally, so there might be some other approaches that I can't think of!

I'd suggest asking your question in dbt Discourse -- it's best place to get help & advice from the community of dbt practitioners:

https://discourse.getdbt.com/c/help/19

If you go with idea 2), it sounds like you wouldn't be the first to resort to a similar strategy [1]:

Because of the query execution inefficiency for nested fields, ETL pipelines were set up at LinkedIn to copy the nested columns as a set of top-level columns corresponding to subfields

I'm going to close this because I don't think we can do anything about it within dbt-redshift. But if anyone discovers we can do something about it, we'll be glad to re-open it.

dossy · 2024-08-12T16:47:02Z

Sorry for necroposting, but since I had this exact same issue (incremental model on Redshift Spectrum), I wanted to share the work-around that I came up with, as I know I won't remember the next time I need it and hopefully will find my own comment here to remind me, and maybe it'll help others who wind up here as well, as it was the first result in Google for Predicates not yet supported in nested s3 scan.

My solution, tl;dr:

{% set last_timestamp_query %}

    select max(event_timestamp) from {{ this }}

{% endset %}

{% set last_timestamp = dbt_utils.get_single_value(last_timestamp_query, default=0) if is_incremental() else 0 %}

select *
from {{ source('some_schema', 'some_table') }}
where 1 = 1

{% if is_incremental() %}

and event_timestamp > {{ last_timestamp }}

{% endif %}

The implementation will need to be customized to your specific object names, but this is the "skeleton" of the approach that solves the problem for me.

Hope this helps!

davide-vito added bug Something isn't working triage labels Dec 21, 2022

github-actions bot changed the title ~~[Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum~~ [CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum Dec 21, 2022

dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 22, 2022

dbeatty10 added wontfix This will not be worked on and removed triage labels Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum #254

[CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum #254

davide-vito commented Dec 21, 2022 •

edited

Loading

dbeatty10 commented Dec 22, 2022

dossy commented Aug 12, 2024

[CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum #254

[CT-1704] [Bug] Incremental models updates failing due to nested data limitation in Redshift Spectrum #254

Comments

davide-vito commented Dec 21, 2022 • edited Loading

Is this a new bug in dbt-redshift?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

dbeatty10 commented Dec 22, 2022

dossy commented Aug 12, 2024

davide-vito commented Dec 21, 2022 •

edited

Loading