SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600

willsthompson · 2021-01-12T19:01:28Z

Please answer these questions before submitting your issue. Thanks!

What version of Python are you using (python --version)?

Python 3.8.5

What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')?

macOS-10.16-x86_64-i386-64bit

What are the component versions in the environment (pip freeze)?

asn1crypto==1.4.0
attrs==20.3.0
docopt==0.6.2
ecdsa==0.16.1
Flask==1.1.2
future==0.18.2
jmespath==0.10.0
jsonformatter==0.2.3
jsonschema==3.1.1
more-itertools==8.6.0
numpy==1.19.5
oscrypto==1.2.1
packaging==20.8
paho-mqtt==1.5.1
pandas==1.0.3
pandasql==0.7.3
pbr==5.5.1
pluggy==0.13.1
py==1.10.0
pyarrow==2.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycryptodome==3.9.9
pycryptodomex==3.9.9
PyJWT==1.7.1
pykwalify==1.7.0
pyOpenSSL==20.0.1
pytest==5.4.3
python-box==5.2.0
python-dateutil==2.8.1
python-dotenv==0.12.0
python-jose==3.1.0
pytz==2020.5
PyYAML==5.3.1
requests==2.23.0
rsa==4.7
six==1.15.0
snowflake-connector-python==2.3.7
stevedore==3.3.0
urllib3==1.25.11
wcwidth==0.2.5
Werkzeug==0.16.1

What did you do?

I invoked snowflake.connector.pandas_tools.write_pandas on a DataFrame with a column of type datetime64[ns] (using PyArrow as the default backend for ParquetWriter)

What did you expect to see?

I expected the datetime data written to the database verbatim with nanosecond precision.

What did you see instead?

write_pandas fails when Arrow tries writing to Parquet using the default arguments: pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data

This problem IIUC is related to the PyArrow defaults, which for compatibility reasons defaults to Parquet version=’1.0’ behavior, which only supports ms timestamps, and also defaulting allow_truncated_timestamps=False, which raises an exception when any timestamp precision is lost during writing. The end result is always truncating ns-precision timestamps to ms-precision and therefore always throwing an exception.

Since Snowflake supports ns-precision timestamps, I would expect defaults that allow them to be written from DataFrames without error. However, since I imagine it's messy supporting various Parquet backends, I think at a minimum write_pandas should accept kwargs to be passed to the parquet writer, so users can tailor the behavior they want (and workaround backend-specific problems like this one).

The text was updated successfully, but these errors were encountered:

plotneishestvo · 2021-01-13T10:07:43Z

Just as a temporary workaround I use ceil or round function of pandas library for timestamp to fix that, for example ceil function usage:

pdf['some_datetime_column'] = pdf['some_datetime_column'].apply(lambda datetime_val: datetime_val.ceil('ms'))

Probably doesn't work as expected, see comment below

willsthompson · 2021-01-13T15:07:01Z

@plotneishestvo Thanks, yes, we're doing something similar. This is probably a little quicker for big tables:

df['col'] = df['col'].dt.ceil(freq='ms')

plotneishestvo · 2021-01-28T12:24:03Z

Actually for me this ceil function doesn't work, it successfully writes data in parquet and then in snowflake but as a result I get something like 53007-12-08 06:27:57.000 so for me data rows were transformed in a weird way: 2021-01-14 05:20:04.077 -> 53007-12-08 06:27:57.000

willsthompson · 2021-01-28T23:51:08Z

Hm, what's the column type? I have a test that roundtrips a DataFrame with datetime64 to/from Snowflake, asserting that everything returned matches what was inserted. datetime64 columns are processed using the vectorized ceil from above on the way out and stored as DATETIME columns in Snowflake. Everything looks fine in Python and in Snowlake.

willsthompson · 2021-01-28T23:52:35Z

Oh, we're also usingpyarrow==2.0.0, instead of the version your connector wants. Maybe a bug fix in the newer version?

plotneishestvo · 2021-01-29T07:12:31Z

I've tried to use pyarrow==2.0.0 but I have the same issue. I use pandas integration with sql_alchemy and pandas function to_sql

from snowflake.connector.pandas_tools import pd_writer
df.to_sql('oh_my_table', engine, index=False, method=pd_writer, if_exists='append')

and it automatically creates table for dataframe, so it creates it with timestamp_type TIMESTAMP_NTZ(9) that sounds okay. I've tried to update everything and instead of future dates I've got invalid dates 😅

My solution at the moment is to cast date fields in iso strings and manually create a table before using pandas integration, so it automatically converts it to dates in snowflake from strings, and it works. Just want to write it here if somebody tackling with that, because it took me a lot of time to discover that in one small table we have future dates because of this issue :D

plotneishestvo · 2021-01-29T07:25:55Z

This is a my simple experiment stand (I start clean without any table in Snowflake):

import datetime
import pandas as pd
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import pd_writer

engine = create_engine('<my awasome connection URL>')

df = pd.DataFrame([('awdawd', datetime.datetime.now()), ('awdawdawd', datetime.datetime.now() - datetime.timedelta(hours=345))], columns=['id', 'date_field'])
df['date_field'] = df['date_field'].dt.ceil(freq='ms')
df.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append')

my pip list:

Package Version

appnope 0.1.0
asn1crypto 1.4.0
attrs 19.3.0
azure-common 1.1.25
azure-core 1.8.0
azure-storage-blob 12.4.0
backcall 0.1.0
bleach 3.1.4
boto3 1.14.42
botocore 1.17.42
certifi 2020.4.5.1
cffi 1.14.1
chardet 3.0.4
credstash 1.17.1
cryptography 2.9.2
cycler 0.10.0
decorator 4.4.2
defusedxml 0.6.0
docutils 0.15.2
entrypoints 0.3
idna 2.9
importlib-metadata 1.6.0
ipykernel 5.2.0
ipython 7.13.0
ipython-genutils 0.2.0
ipywidgets 7.5.1
isodate 0.6.0
jedi 0.16.0
Jinja2 2.11.1
jmespath 0.10.0
json5 0.9.4
jsonschema 3.2.0
jupyter-client 6.1.2
jupyter-core 4.6.3
jupyterlab 2.1.0
jupyterlab-server 1.1.0
kiwisolver 1.2.0
MarkupSafe 1.1.1
matplotlib 3.2.1
mistune 0.8.4
mock 4.0.2
msrest 0.6.18
nbconvert 5.6.1
nbformat 5.0.5
notebook 6.0.3
numpy 1.18.2
oauthlib 3.1.0
oscrypto 1.2.1
packaging 20.4
pandas 1.1.5
pandocfilters 1.4.2
parso 0.6.2
pexpect 4.8.0
pickleshare 0.7.5
pip 20.0.2
plotly 4.6.0
prometheus-client 0.7.1
prompt-toolkit 3.0.5
psycopg2-binary 2.8.6
ptyprocess 0.6.0
pyarrow 0.17.1
pycparser 2.20
pycryptodomex 3.9.8
Pygments 2.6.1
PyJWT 1.7.1
pyOpenSSL 19.1.0
pyparsing 2.4.7
pyrsistent 0.16.0
python-dateutil 2.8.1
pytz 2019.3
pyzmq 19.0.0
requests 2.23.0
requests-oauthlib 1.3.0
retrying 1.3.3
s3transfer 0.3.3
Send2Trash 1.5.0
setuptools 45.2.0
six 1.14.0
snowflake-connector-python 2.3.9
snowflake-sqlalchemy 1.2.4
SQLAlchemy 1.3.22
sqlalchemy-redshift 0.8.1
terminado 0.8.3
testpath 0.4.4
tornado 6.0.4
traitlets 4.3.3
urllib3 1.25.8
wcwidth 0.1.9
webencodings 0.5.1
wheel 0.34.2
widgetsnbextension 3.5.1
zipp 3.1.0

So, now after I updated SQLALchemy pandas and snowflake libraries i have in the snowflake web interface:

plotneishestvo · 2021-01-29T11:14:24Z

Okay, looks like snowflake also has issues to understand parquet v2 timestamp values, I've tried to upload data using parquet verson '2.0' and I got the same invalid date result.

firstly i passed parameter to pandas to store the same dataframe as parquet version 2.0:

df.to_parquet('my_awasome_parquet.parquet', compression='snappy', engine='pyarrow', version='2.0')  # version='2.0' makes a deal

and just to check

pq.read_metadata('my_awasome_parquet.parquet')
#output:
<pyarrow._parquet.FileMetaData object at 0x11ee05778>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 2.0
  serialized_size: 1940

after that I've manually uploaded data to the table using this parquet 2.0 file and got the same invalid date result
ThenI tried to query my staged parquet v2.0 file and it was fine:

SELECT $1:date_field FROM @"TEST_DB"."PUBLIC".oh_my_df ;

However, casting to timestamp resulted as invalid date

SELECT $1:date_field::timestamp_ntz FROM @"TEST_DB"."PUBLIC".oh_my_df ;

plotneishestvo · 2021-01-29T11:47:14Z

Okay, starting with pandas 1.1.0 it automatically uses parquet 2.0 and this is why i started receiving invalid date so i will open a new issue to support pandas 1.1.0 and parquet 2.0

willsthompson · 2021-01-29T15:24:38Z

This is a little out of scope for this issue, but if at all possible it would be helpful to support newer versions of PyArrow (they just released v3 early this week), for compatibility and bugfixes, but also because the early versions of PyArrow required by the Snowflake connector are enormous payloads, over 200MB IIRC. Newer releases are in the 50MB range. This can have a significant impact on deployments.

sfc-gh-mkeller · 2021-02-03T01:19:46Z

...(they just released v3 early this week)...

👀 #565

daniel-sali · 2021-03-25T06:02:28Z

I drafted a possible solution to the issue here:
#616 (comment)

sfc-gh-kzaveri · 2021-05-19T22:07:54Z

Hi all, we continue to work on a long-term solution internally. We're sorry about any inconvenience this issue might be causing. As a temporary workaround, we suggest passing a timezone to any timestamp values when building your pandas dataframe. The following works for millisecond and microsecond timestamps,
For example, instead of using:

ts = pd.Timestamp(1621291701002, unit="ms")
pd.DataFrame([('a', ts)], columns=['ID', 'DATE_FIELD'])

you can define the dataframe as:

ts = pd.Timestamp(1621291701002, unit="ms", tz="UTC")
pd.DataFrame([('a', ts)], columns=['ID', 'DATE_FIELD'])

When the dataframe is unloaded into a parquet file and COPY'ed into Snowflake, the value will be correctly parsed:

As for ns timestamps, specifically the issue mentioned by @willsthompson, using df.to_sql will avoid the error you are receiving as that converts nanosecond timestamps to microsecond timestamps (and then using the timezone suggestion above will ensure you see valid dates in Snowflake). Once again, we apologize for any inconvenience as a result of this issue.

bcikili · 2021-12-14T03:08:48Z

I came across the same problem, i ve solved it by converting npdatetime64 to object:
output['SUBMODEL_VALUE_DATE'] =output['SUBMODEL_VALUE_DATE'].dt.strftime("%Y-%m-%d")

russellpierce · 2022-04-12T15:41:32Z

It seems like this issue still persists. Moreover, for [ns] timestamps it seems like it can also affect the Pandas to_sql method even with pd_writer. Is there a combination of known versions for which this is resolved?

aduverger · 2022-05-12T20:50:07Z

I solved the problem by adding allow_truncated_timestamps=True in chunk.to_parquet.
According to pyarrow doc:

Some Parquet readers may only support timestamps stored in millisecond ('ms') or microsecond ('us') resolution. Since pandas uses nanoseconds to represent timestamps, this can occasionally be a nuisance. By default (when writing version 1.0 Parquet files), the nanoseconds will be cast to microseconds (‘us’).
If a cast to a lower resolution value may result in a loss of data, by default an exception will be raised. This can be suppressed by passing allow_truncated_timestamps=True

In /snowflake/connector/pandas_tools.py line 159:

chunk.to_parquet(
                chunk_path,
                compression=compression,
                allow_truncated_timestamps=True,
            )

or alternatively:

chunk.to_parquet(
                chunk_path,
                compression=compression,
                use_deprecated_int96_timestamps=True,
            )

Does this solve your problem too ?

chamini2 · 2022-07-29T19:49:17Z

Hey, I am trying the proposed alternative from @sfc-gh-kzaveri and I am sending a dataframe with tz defined

df['my_datetime'][0]
Timestamp('2022-01-01 14:50:59+0000', tz='UTC')

The table is created as TIMESTAMP_NTZ

It is uploaded without time zone info

And when read with df: pd.DataFrame = cur.fetch_pandas_all() it arrives without a time zone associated in the DataFrame.

Timestamp('2022-01-01 14:50:59')

Any ideas what might make this behave like this? Any specific versions I should be using?

ChuliangXiao · 2022-11-04T17:58:49Z

Having similar issues with Snowpark write_pandas and createDataFrame, a Pandas timestamp column

with local TimeZone --> correct TIMESTAMP_NTZ in Snowflake
without TimeZone --> NUMBER in Snowflake

fredrike · 2022-12-19T08:56:12Z

Having similar issues with Snowpark write_pandas and createDataFrame, a Pandas timestamp column

with local TimeZone --> correct TIMESTAMP_NTZ in Snowflake

without TimeZone --> NUMBER in Snowflake

Thanks for pointing this out!

The following snippet helped me:

df['Date'] = df['Date'].dt.tz_localize("UTC+01:00").dt.ceil(freq='ms')

DzimitryM · 2023-03-01T19:18:00Z

Adding the parameter use_deprecated_int96_timestamps=True to write_pandas() helped me. The previous snippet by @fredrike with rounding the timestamps also worked. This approach allows to preserve original timestamp values without truncating them:

write_pandas(con, df, "TABLE_NAME", use_deprecated_int96_timestamps=True)

sfc-gh-aalam · 2023-04-19T00:53:14Z

We are internally working on a more permanent solution and provide an update next quarter.

willsthompson · 2023-11-02T20:29:28Z

Hi @sfc-gh-aalam, do you have an update on this?

sfc-gh-aalam · 2023-11-02T21:24:31Z

take a look at #1687. Can you try using use_logical_type

ericpettengill · 2023-11-09T17:34:23Z

Nice! This works with sqlalchemy as well. Previously had to convert pd.Timestamp -> str with method=pd_writer and dtype={}.

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
from snowflake.connector.pandas_tools import make_pd_writer
from snowflake.sqlalchemy import DATETIME

creds = {
    'user': '',
    'password': '',
    'account': '',
    'warehouse': '',
    'role': '',
    'database': ''
}
engine = create_engine(URL(**creds))


time = pd.date_range('2023-01-01', '2023-01-31', freq='1h')
df = pd.DataFrame({
    'timestamp': time,
    'test_col': ['XYZ123'] * len(time),
    'values': np.random.random(len(time))
})
# df['timestamp'] = df.timestamp.astype(str)  ### previous workaround for proper pd.Timestamp -> snowflake timestamp conversion

with engine.connect() as con:
    df.to_sql(
        name='test_table',
        schema='schema_name',
        index=False,
        con=con,
        if_exists='replace',
        # dtype={'timestamp': DATETIME},  ### previous workaround for proper pd.Timestamp -> snowflake timestamp conversion
        # method=pd_writer  ### previous workaround
        method=make_pd_writer(use_logical_type=True)
    )

sfc-gh-aalam · 2023-11-09T19:13:25Z

closing this issue as use_logical_type is working as expected. Please open a new issue if you see more issues.

github-actions bot changed the title ~~write_pandas fails for some datetime64 values using PyArrow backend~~ SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend Jan 12, 2021

plotneishestvo mentioned this issue Jan 29, 2021

SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616

Closed

jamiedemaria mentioned this issue Jul 6, 2022

Snowflake IO Manager handle pandas timestamps dagster-io/dagster#8760

Merged

brandon-leapyear mentioned this issue Jul 14, 2022

SNOW-582728: Invalid Pyarrow Timestamps Returned #1108

Closed

sfc-gh-sfan mentioned this issue Aug 2, 2023

SNOW-884196: Snowflake connector write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date #1677

Closed

ericpettengill mentioned this issue Nov 9, 2023

SNOW-886649: write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date snowflakedb/snowpark-python#991

Open

sfc-gh-aalam closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600

SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600

willsthompson commented Jan 12, 2021

plotneishestvo commented Jan 13, 2021 •

edited

Loading

willsthompson commented Jan 13, 2021

plotneishestvo commented Jan 28, 2021

willsthompson commented Jan 28, 2021

willsthompson commented Jan 28, 2021

plotneishestvo commented Jan 29, 2021 •

edited

Loading

plotneishestvo commented Jan 29, 2021

plotneishestvo commented Jan 29, 2021

plotneishestvo commented Jan 29, 2021

willsthompson commented Jan 29, 2021

sfc-gh-mkeller commented Feb 3, 2021

daniel-sali commented Mar 25, 2021

sfc-gh-kzaveri commented May 19, 2021

bcikili commented Dec 14, 2021

russellpierce commented Apr 12, 2022

aduverger commented May 12, 2022 •

edited

Loading

chamini2 commented Jul 29, 2022 •

edited

Loading

ChuliangXiao commented Nov 4, 2022 •

edited

Loading

fredrike commented Dec 19, 2022

DzimitryM commented Mar 1, 2023 •

edited

Loading

sfc-gh-aalam commented Apr 19, 2023

willsthompson commented Nov 2, 2023

sfc-gh-aalam commented Nov 2, 2023

ericpettengill commented Nov 9, 2023

sfc-gh-aalam commented Nov 9, 2023

SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600

SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600

Comments

willsthompson commented Jan 12, 2021

plotneishestvo commented Jan 13, 2021 • edited Loading

willsthompson commented Jan 13, 2021

plotneishestvo commented Jan 28, 2021

willsthompson commented Jan 28, 2021

willsthompson commented Jan 28, 2021

plotneishestvo commented Jan 29, 2021 • edited Loading

plotneishestvo commented Jan 29, 2021

plotneishestvo commented Jan 29, 2021

plotneishestvo commented Jan 29, 2021

willsthompson commented Jan 29, 2021

sfc-gh-mkeller commented Feb 3, 2021

daniel-sali commented Mar 25, 2021

sfc-gh-kzaveri commented May 19, 2021

bcikili commented Dec 14, 2021

russellpierce commented Apr 12, 2022

aduverger commented May 12, 2022 • edited Loading

chamini2 commented Jul 29, 2022 • edited Loading

ChuliangXiao commented Nov 4, 2022 • edited Loading

fredrike commented Dec 19, 2022

DzimitryM commented Mar 1, 2023 • edited Loading

sfc-gh-aalam commented Apr 19, 2023

willsthompson commented Nov 2, 2023

sfc-gh-aalam commented Nov 2, 2023

ericpettengill commented Nov 9, 2023

sfc-gh-aalam commented Nov 9, 2023

plotneishestvo commented Jan 13, 2021 •

edited

Loading

plotneishestvo commented Jan 29, 2021 •

edited

Loading

aduverger commented May 12, 2022 •

edited

Loading

chamini2 commented Jul 29, 2022 •

edited

Loading

ChuliangXiao commented Nov 4, 2022 •

edited

Loading

DzimitryM commented Mar 1, 2023 •

edited

Loading