Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug[duckdb-geospatial]: read_parquet defaults silently to pyarrow reading geometry as binary #9882

Open
ncclementi opened this issue Aug 20, 2024 · 4 comments
Labels
duckdb The DuckDB backend geospatial Geospatial related functionality io Issues related to input and/or output

Comments

@ncclementi
Copy link
Contributor

ncclementi commented Aug 20, 2024

EDIT: SEE NEW COMMENTS.

When we read a parquet file that was written with to_parquet (ibis) and contains a geometry column, it reads it back as binary. This is a bug on our end, as this doesn't happen if the file is written with plain duckdb.

Reproducer

import ibis
from ibis import _

con = ibis.get_backend()

url = "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure"
t = con.read_parquet(url, table_name="infra")

expr = t.filter(_.bbox.xmin > -77.119795,
                _.bbox.xmax < -76.909366,
                _.bbox.ymin > 38.791631,
                _.bbox.ymax < 38.995968
                )
con.to_parquet(expr, "infra_ibis.parquet")

t2 = con.read_parquet("infra_ibis.parquet")
ibis.options.interactive = True

t2.head(3).select("geometry")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ geometry                                                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ binary                                                     │
├────────────────────────────────────────────────────────────┤
│ b'\x00\x00\x00\x00\x01\xc0SGY\xb5\xe9[y@Ceo\xef|$>'        │
│ b'\x00\x00\x00\x00\x01\xc0SGY\xb3\xd0|\x85@Cet$\x95\n\xbf' │
│ b'\x00\x00\x00\x00\x01\xc0SGX+:@n@Cev\x88\x1c\x99\x94'     │
└────────────────────────────────────────────────────────────┘
@ncclementi ncclementi added geospatial Geospatial related functionality duckdb The DuckDB backend io Issues related to input and/or output labels Aug 20, 2024
@ncclementi
Copy link
Contributor Author

ncclementi commented Aug 21, 2024

I was investigating this further and it looks like in our end read_parquet tries to use the duckdb read, and if it fails it defaults to pyarrow.
https://github.com/ibis-project/ibis/blob/0a6765b664636fa7deccaead38a4ee2bd8548232/ibis/backends/duckdb/__init__.py#L853C1-L859C82

I realized I had a typo on my Ibis url.

url = "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure" 
t = con.read_parquet(url, table_name="infra")
t
DatabaseTable: infra
  id          string
  geometry    binary
  bbox        xmin: float32
  xmax: float32
  ymin: float32
  ymax: float32
  version     int32
  ...

If I do this, notice /* at the end of the url

url = "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure/*"
t = con.read_parquet(url, table_name="infra")
t

I get the type I want.

DatabaseTable: infra
  id          string
  geometry    geospatial:geometry
  bbox        xmin: float32
  xmax: float32
  ymin: float32
  ymax: float32
  version     int32

This made me think, that the exception might be being triggered and defaulting to pyarrow, and somehow pyarrow can read the url without /*

I tried the url without /* on duckdb and I got

import duckdb
duckdb.sql("install spatial;")
duckdb.sql("load spatial;")

sql = """SELECT *
    FROM read_parquet('s3://overturemaps-us-west-2/release/2024-06-13-beta.1/theme=base/type=infrastructure')
    WHERE bbox.xmin > -84.36
    AND bbox.xmax < -82.42
    AND bbox.ymin > 41.71
    AND bbox.ymax < 43.33;
"""
tdb = duckdb.sql(sql)
---------------------------------------------------------------------------
HTTPException                             Traceback (most recent call last)
Cell In[5], line 8
      1 sql = """SELECT *
      2     FROM read_parquet('s3://overturemaps-us-west-2/release/2024-06-13-beta.1/theme=base/type=infrastructure')
      3     WHERE bbox.xmin > -84.36
   (...)
      6     AND bbox.ymax < 43.33;
      7 """
----> 8 tdb = duckdb.sql(sql)
      9 # duckdb.sql("SELECT * FROM tdb LIMIT 10;")

HTTPException: HTTP Error: Unable to connect to URL "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-06-13-beta.1/theme%3Dbase/type%3Dinfrastructure": 404 (Not Found)

and turns out that HTTPException inherits from IOException on duckdb see code here

so it's happening exactly what I thought.

I personally don't like that this happen silently.

Solutions:

  • Give a warning maybe
  • Stop defaulting
  • Does pyarrow parquet reader reads geoparquet? I can't find docs on it

@ncclementi ncclementi changed the title bug[duckdb-geospatial]: to_parquet roundtrip does not read geometry properly bug[duckdb-geospatial]: read_parquet defaults silently to pyarrow reading geometry as binary Aug 21, 2024
@cpcloud
Copy link
Member

cpcloud commented Aug 21, 2024

We really need to straighten out the pyarrow versus duckdb cloud read behavior.

My vote is to use DuckDB's readers and work with them to smooth out any rough edges.

@cpcloud
Copy link
Member

cpcloud commented Aug 21, 2024

Whatever we do, we can't continue with fall back behavior.

@gforsyth
Copy link
Member

Yeah, agreed the fallback behavior has to go. If we want to use the DuckDB readers, we'll probably also want to figure out how to expose the new credential manager (since the reason for the fallback in the first place was for better AWS credential support)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb The DuckDB backend geospatial Geospatial related functionality io Issues related to input and/or output
Projects
Status: backlog
Development

No branches or pull requests

3 participants