-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616
Comments
Pandas 1.1.0+ by default is writing Parquet files on a microsecond resolution, you need to explicitly pass The
Additionally, the COPY INTO statement needs to be updated in
TO_TIMESTAMP($1:"microsecond_tst_col"::INT,6) or TO_TIMESTAMP($1:"nanosecond_tst_col"::INT,9) depending on whether the version="2.0" flag was used to write the temporary Parquet file.
|
Sorry, I was wrong in description It actually uses parquet |
@daniel-sali doing this automatically in the library seems error prone and it will likely lead to more issues in the future, do you guys think that adding an option for the user to override the |
@sfc-gh-mkeller , we have actually implemented the COPY INTO statement generation logic in our connector library, which creates a connection to Snowflake acting as our data warehouse. |
Actually, I think this is a Snowflake database bug but not a connector issue. I tried to test different ways of loading data into Snowflake with the latest available libraries in my Jupyter lab and It looks like pandas can handle ms timestamps now and write it in the right datatype into parquet files but snowflake gets it as an invalid date. So. Shortly I think the issue on the Snowflake side and I need to figure out a way how to report a bug. However, I will share all details step by step in the next comment. |
I created a new virtual environment and installed the latest packages from pip. Here you can see pip list output:
The most important: snowflake-connector-python=2.4.2 pyarrow=3.0.0 and pandas=1.1.5 python version 3.6.5 First, I tried to check how new pandas behaves with parquet files, so I created two data frames with datetime objects which have a milliseconds resolution and one data frame with pandas timestamps and those are with nanoseconds resolution. df = pd.DataFrame([('awdawd', datetime.datetime.now()), ('awdawdawd', datetime.datetime.now() - datetime.timedelta(hours=345))], columns=['id', 'date_field'])
### output:
id - object date_field - datetime64[ns]
awdawd 2021-04-08 14:36:53.388976
awdawdawd 2021-03-25 05:36:53.388981
# and with ns
df_ns = pd.DataFrame([('some_ns_dumm', pd.Timestamp(2021, 4, 4, 18, 15, 35, 345, 23)), ('more_ns_dummy', pd.Timestamp(2021, 4, 4, 18, 15, 35, 345, 23))], columns=['id', 'date_field'])
# output
id - object date_field - datetime64[ns]
some_ns_dumm 2021-04-04 18:15:35.000345023
more_ns_dummy 2021-04-04 18:15:35.000345023 Then, I decided to store it as parquet files and look into data and metadata: df.to_parquet('my_awasome_parquet_us_by_fact.parquet', compression='snappy')
pq.read_metadata('my_awasome_parquet_us_by_fact.parquet')
# output:
<pyarrow._parquet.FileMetaData object at 0x7ffe59282258>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 1933
# -----
pq.read_schema('my_awasome_parquet_us_by_fact.parquet')
# output:
id: string
-- field metadata --
PARQUET:field_id: '1'
date_field: timestamp[us]
-- field metadata --
PARQUET:field_id: '2'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 491
#----
table = pq.read_table('my_awasome_parquet_us_by_fact.parquet')
table.columns
# output:
[<pyarrow.lib.ChunkedArray object at 0x7ffe5afb0888>
[
[
"awdawd",
"awdawdawd"
]
],
<pyarrow.lib.ChunkedArray object at 0x7ffe5afb0ba0>
[
[
2021-04-08 14:36:53.388976,
2021-03-25 05:36:53.388981
]
]] As we can see, the current pandas library manages to write parquet file with microseconds resolution automatically: df_ns.to_parquet('my_awasome_parquet_ns_by_fact.parquet', compression='snappy') # this one with values that have nanoseconds
# output
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1617560135000345023 So, we got an error that it can't process with nanoseconds, so far it looks okay-ish. Now, I can upload first successful parquet manually: with engine.begin() as conn:
conn.execute('create temporary stage demo_db.public.parquet_test_ffr34')
conn.execute('PUT file:///some long path/workspace/my_awasome_parquet_us_by_fact.parquet @parquet_test_ffr34')
conn.execute("""COPY INTO oh_my_df
FROM @parquet_test_ffr34
FILE_FORMAT=(TYPE = PARQUET)
MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE
FORCE=TRUE;""")
conn.execute('drop stage demo_db.public.parquet_test_ffr34') And I can see only Then I tried to upload data using snowflake-connector methods and result was the same: I tried two different methods with the same DataFrame that only consists datetime objects with milliseconds resolution
write_pandas(connection, df, 'oh_my_df', 'demo_db', 'public') # result invalid dates
df.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append') # result invalid dates As expected when I'm trying to write DataFrame with nanoseconds values it failed with familiar exception: write_pandas(connection, df_ns, 'oh_my_df', 'demo_db', 'public')
# output
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1617560135000345023 However, I was surprised that another method worked: df_ns.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append') # result - invalid date As a result I got My next move was to try to use parquet version 2.0: df.to_parquet('my_awasome_parquet_us_by_fact_2.parquet', compression='snappy', engine='pyarrow', version='2.0')
df_ns.to_parquet('my_awasome_parquet_ns_by_fact_2.parquet', compression='snappy', engine='pyarrow', version='2.0') the parquet data:
I've tried to upload v2.0 files, for example by doing this: with engine.begin() as conn:
conn.execute('create temporary stage demo_db.public.parquet_test_ffr36')
conn.execute('PUT file:///long long path/workspace/my_awasome_parquet_ns_by_fact_2.parquet @parquet_test_ffr36')
conn.execute("""COPY INTO oh_my_df
FROM @parquet_test_ffr36
FILE_FORMAT=(TYPE = PARQUET)
MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE
FORCE=TRUE;""")
conn.execute('drop stage demo_db.public.parquet_test_ffr36') but got with engine.begin() as conn:
conn.execute('create temporary stage demo_db.public.parquet_test_ffr36')
conn.execute('PUT file://long long path/workspace/my_awasome_parquet_ns_by_fact_2.parquet @parquet_test_ffr36')
conn.execute("""COPY INTO oh_my_df(id, date_field)
FROM (SELECT $1:id::string, TO_TIMESTAMP($1:date_field::int, 9) FROM @parquet_test_ffr36)
FILE_FORMAT=(TYPE = PARQUET)
FORCE=TRUE;""")
conn.execute('drop stage demo_db.public.parquet_test_ffr36') And this was successful try I got some dates in database but values were without nanoseconds, nanoseconds were truncated but data type supposed to support those!!!!!: So, i decided to play with all my attempts more and to cast all date values in snowflake to strings SELECT date_field::string FROM "DEMO_DB"."PUBLIC"."OH_MY_DF";
So, I would like to use |
@sfc-gh-yuliu is parquet 2.0 on our radar yet? |
No, we don't have any plan to update to Parquet 2.0 yet. |
To clarify @plotneishestvo test Issue with invalid date is not caused by connector. Parquet files store the timestamp with ns value as an integer, under the form: 1619025787640560
We get: |
Can we close this issue? |
To clean up and re-prioritize bugs and feature requests we are closing all issues older than 6 months as of March 1, 2023. If there are any issues or feature requests that you would like us to address, please re-create them. For urgent issues, opening a support case with this link Snowflake Community is the fastest way to get a response. |
Please answer these questions before submitting your issue. Thanks!
What version of Python are you using (
python --version
)?Python 3.6.5
What operating system and processor architecture are you using (
python -c 'import platform; print(platform.platform())'
)?Darwin-19.6.0-x86_64-i386-64bit
What are the component versions in the environment (
pip freeze
)?I'm just writing pandas dataframe with datetime field in snowflake using pd_writer
code:
What did you expect to see?
I'm expecting to see in database the same datetime values as I see in the dataframe
What did you see instead?
i see
invalid date
values insteadThis is continuation of another issue: #600 I'm not sure if it is possible to solve on the snowflake-connector level but looks like pandas >= 1.1.0
using parquet v 2.0 to storesuccessfully sores timestamp values in ns resolution, however when tried to upload parquet files 2.0 manually I have the same result.It would be nice to either find a solution how to work with parquet v2.0 somehow or with nanosecond resolution.
The text was updated successfully, but these errors were encountered: