-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug[duckdb-geospatial]: read_parquet defaults silently to pyarrow reading geometry as binary #9882
Comments
I was investigating this further and it looks like in our end read_parquet tries to use the duckdb read, and if it fails it defaults to pyarrow. I realized I had a typo on my Ibis url. url = "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure"
t = con.read_parquet(url, table_name="infra")
t
If I do this, notice url = "s3://overturemaps-us-west-2/release/2024-07-22.0/theme=base/type=infrastructure/*"
t = con.read_parquet(url, table_name="infra")
t I get the type I want.
This made me think, that the exception might be being triggered and defaulting to pyarrow, and somehow pyarrow can read the url without I tried the url without import duckdb
duckdb.sql("install spatial;")
duckdb.sql("load spatial;")
sql = """SELECT *
FROM read_parquet('s3://overturemaps-us-west-2/release/2024-06-13-beta.1/theme=base/type=infrastructure')
WHERE bbox.xmin > -84.36
AND bbox.xmax < -82.42
AND bbox.ymin > 41.71
AND bbox.ymax < 43.33;
"""
tdb = duckdb.sql(sql) ---------------------------------------------------------------------------
HTTPException Traceback (most recent call last)
Cell In[5], line 8
1 sql = """SELECT *
2 FROM read_parquet('s3://overturemaps-us-west-2/release/2024-06-13-beta.1/theme=base/type=infrastructure')
3 WHERE bbox.xmin > -84.36
(...)
6 AND bbox.ymax < 43.33;
7 """
----> 8 tdb = duckdb.sql(sql)
9 # duckdb.sql("SELECT * FROM tdb LIMIT 10;")
HTTPException: HTTP Error: Unable to connect to URL "https://overturemaps-us-west-2.s3.amazonaws.com/release/2024-06-13-beta.1/theme%3Dbase/type%3Dinfrastructure": 404 (Not Found) and turns out that HTTPException inherits from IOException on duckdb see code here so it's happening exactly what I thought. I personally don't like that this happen silently. Solutions:
|
We really need to straighten out the pyarrow versus duckdb cloud read behavior. My vote is to use DuckDB's readers and work with them to smooth out any rough edges. |
Whatever we do, we can't continue with fall back behavior. |
Yeah, agreed the fallback behavior has to go. If we want to use the DuckDB readers, we'll probably also want to figure out how to expose the new credential manager (since the reason for the fallback in the first place was for better AWS credential support) |
EDIT: SEE NEW COMMENTS.
When we read a parquet file that was written with to_parquet (ibis) and contains a geometry column, it reads it back as binary. This is a bug on our end, as this doesn't happen if the file is written with plain duckdb.
Reproducer
The text was updated successfully, but these errors were encountered: