SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616

plotneishestvo · 2021-01-29T12:02:37Z

Please answer these questions before submitting your issue. Thanks!

What version of Python are you using (python --version)?
Python 3.6.5
What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')?
Darwin-19.6.0-x86_64-i386-64bit
What are the component versions in the environment (pip freeze)?

``` appnope==0.1.0 asn1crypto==1.4.0 attrs==19.3.0 azure-common==1.1.25 azure-core==1.8.0 azure-storage-blob==12.4.0 backcall==0.1.0 bleach==3.1.4 boto3==1.14.42 botocore==1.17.42 certifi==2020.4.5.1 cffi==1.14.1 chardet==3.0.4 credstash==1.17.1 cryptography==2.9.2 cycler==0.10.0 decorator==4.4.2 defusedxml==0.6.0 docutils==0.15.2 entrypoints==0.3 idna==2.9 importlib-metadata==1.6.0 ipykernel==5.2.0 ipython==7.13.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isodate==0.6.0 jedi==0.16.0 Jinja2==2.11.1 jmespath==0.10.0 json5==0.9.4 jsonschema==3.2.0 jupyter-client==6.1.2 jupyter-core==4.6.3 jupyterlab==2.1.0 jupyterlab-server==1.1.0 kiwisolver==1.2.0 MarkupSafe==1.1.1 matplotlib==3.2.1 mistune==0.8.4 mock==4.0.2 msrest==0.6.18 nbconvert==5.6.1 nbformat==5.0.5 notebook==6.0.3 numpy==1.18.2 oauthlib==3.1.0 oscrypto==1.2.1 packaging==20.4 pandas==1.1.5 pandocfilters==1.4.2 parso==0.6.2 pexpect==4.8.0 pickleshare==0.7.5 plotly==4.6.0 prometheus-client==0.7.1 prompt-toolkit==3.0.5 psycopg2-binary==2.8.6 ptyprocess==0.6.0 pyarrow==0.17.1 pycparser==2.20 pycryptodomex==3.9.8 Pygments==2.6.1 PyJWT==1.7.1 pyOpenSSL==19.1.0 pyparsing==2.4.7 pyrsistent==0.16.0 python-dateutil==2.8.1 pytz==2019.3 pyzmq==19.0.0 requests==2.23.0 requests-oauthlib==1.3.0 retrying==1.3.3 s3transfer==0.3.3 Send2Trash==1.5.0 six==1.14.0 snowflake-connector-python==2.3.9 snowflake-sqlalchemy==1.2.4 SQLAlchemy==1.3.22 sqlalchemy-redshift==0.8.1 terminado==0.8.3 testpath==0.4.4 tornado==6.0.4 traitlets==4.3.3 urllib3==1.25.8 wcwidth==0.1.9 webencodings==0.5.1 widgetsnbextension==3.5.1 zipp==3.1.0 ```

What did you do?
I'm just writing pandas dataframe with datetime field in snowflake using pd_writer
code:

import datetime
import pandas as pd
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import pd_writer

engine = create_engine('<my awasome connection URL>')

df = pd.DataFrame([('awdawd', datetime.datetime.now()), ('awdawdawd', datetime.datetime.now() - datetime.timedelta(hours=345))], columns=['id', 'date_field'])
df.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append')

What did you expect to see?
I'm expecting to see in database the same datetime values as I see in the dataframe
What did you see instead?
i see invalid date values instead

This is continuation of another issue: #600 I'm not sure if it is possible to solve on the snowflake-connector level but looks like pandas >= 1.1.0 ~~using parquet v 2.0 to store~~ successfully sores timestamp values in ns resolution, however when tried to upload parquet files 2.0 manually I have the same result.

It would be nice to either find a solution how to work with parquet v2.0 somehow or with nanosecond resolution.

The text was updated successfully, but these errors were encountered:

daniel-sali · 2021-03-25T05:59:54Z

Pandas 1.1.0+ by default is writing Parquet files on a microsecond resolution, you need to explicitly pass version="2.0" to write seconds on a nanosecond resolution (this limits the writable min/max timestamps due to the INT64 representation, I assume).
Example:
df.to_parquet(chunk_path, compression="snappy", index=index, version="2.0")
Source: https://github.com/pandas-dev/pandas/blob/7d9dd1a68334cd7a8ed19629c63e84713abc77ef/pandas/tests/io/test_parquet.py#L882

The write_pandas and pd_writer snowflake-connector-python functions should receive an optional argument whether to use nanoseconds when writing the temporary Parquet files, which would set the version="2.0" flag in

snowflake-connector-python/src/snowflake/connector/pandas_tools.py

Line 160 in b59814d

chunk.to_parquet(chunk_path, compression=compression)

.

Additionally, the COPY INTO statement needs to be updated in

snowflake-connector-python/src/snowflake/connector/pandas_tools.py

Line 181 in b59814d

parquet_columns = "$1:" + ",$1:".join(df.columns)

to use TO_TIMESTAMP($1:"microsecond_tst_col"::INT,6) or TO_TIMESTAMP($1:"nanosecond_tst_col"::INT,9) depending on whether the version="2.0" flag was used to write the temporary Parquet file.

plotneishestvo · 2021-04-07T09:45:57Z

Sorry, I was wrong in description It actually uses parquet version=1.0 by default. However new pandas with pyarrow manages to work normally with nanoseconds time resolution but Snowflake can't automatically cast data fields and as a result I can see invalid date... Even in manual copying I'm trying to use flag to match columns by name, and using TO_TIMESTAMP($1:"nanosecond_tst_col"::INT,9)can help if I parse manually but It would be nice to have auto casting working right since column type in table already has nanoseconds resolution by default.

sfc-gh-mkeller · 2021-04-07T16:44:54Z

to use TO_TIMESTAMP($1:"microsecond_tst_col"::INT,6) or TO_TIMESTAMP($1:"nanosecond_tst_col"::INT,9) depending on whether the version="2.0" flag was used to write the temporary Parquet file.

@daniel-sali doing this automatically in the library seems error prone and it will likely lead to more issues in the future, do you guys think that adding an option for the user to override the COPY INTO statement generated by our pandas_tools would be a good solution for this?

daniel-sali · 2021-04-08T11:34:01Z

@sfc-gh-mkeller , we have actually implemented the COPY INTO statement generation logic in our connector library, which creates a connection to Snowflake acting as our data warehouse.
See https://github.com/nuvolos-cloud/python-connector/blob/master/nuvolos/sql_utils.py , entry point is the to_sql function.
The code is based on the write_pandas function and pandas' io/sql.py for DataFrame column type -> Snowflake column type mapping.

plotneishestvo · 2021-04-08T14:08:12Z

Actually, I think this is a Snowflake database bug but not a connector issue. I tried to test different ways of loading data into Snowflake with the latest available libraries in my Jupyter lab and It looks like pandas can handle ms timestamps now and write it in the right datatype into parquet files but snowflake gets it as an invalid date. So. Shortly I think the issue on the Snowflake side and I need to figure out a way how to report a bug. However, I will share all details step by step in the next comment.

plotneishestvo · 2021-04-08T15:01:27Z

I created a new virtual environment and installed the latest packages from pip. Here you can see pip list output:

Package                    Version
-------------------------- ---------
anyio                      2.2.0
appnope                    0.1.2
argon2-cffi                20.1.0
asn1crypto                 1.4.0
async-generator            1.10
attrs                      20.3.0
azure-common               1.1.27
azure-core                 1.13.0
azure-storage-blob         12.8.0
Babel                      2.9.0
backcall                   0.2.0
bleach                     3.3.0
boto3                      1.17.45
botocore                   1.20.45
certifi                    2020.12.5
cffi                       1.14.5
chardet                    3.0.4
contextvars                2.4
cryptography               3.4.7
dataclasses                0.8
decorator                  5.0.5
defusedxml                 0.7.1
entrypoints                0.3
greenlet                   1.0.0
idna                       2.10
immutables                 0.15
importlib-metadata         3.10.0
ipykernel                  5.5.3
ipython                    7.16.1
ipython-genutils           0.2.0
isodate                    0.6.0
jedi                       0.17.2
Jinja2                     2.11.3
jmespath                   0.10.0
json5                      0.9.5
jsonschema                 3.2.0
jupyter-client             6.1.12
jupyter-core               4.7.1
jupyter-lsp                1.1.4
jupyter-packaging          0.7.12
jupyter-server             1.5.1
jupyterlab                 3.0.12
jupyterlab-lsp             3.5.0
jupyterlab-pygments        0.1.2
jupyterlab-server          2.4.0
MarkupSafe                 1.1.1
mistune                    0.8.4
msrest                     0.6.21
nbclassic                  0.2.6
nbclient                   0.5.3
nbconvert                  6.0.7
nbformat                   5.1.3
nest-asyncio               1.5.1
notebook                   6.3.0
numpy                      1.19.5
oauthlib                   3.1.0
oscrypto                   1.2.1
packaging                  20.9
pandas                     1.1.5
pandocfilters              1.4.3
parso                      0.7.1
pexpect                    4.8.0
pickleshare                0.7.5
pip                        21.0.1
pluggy                     0.13.1
prometheus-client          0.10.0
prompt-toolkit             3.0.18
ptyprocess                 0.7.0
pyarrow                    3.0.0
pycparser                  2.20
pycryptodomex              3.10.1
Pygments                   2.8.1
PyJWT                      2.0.1
pyOpenSSL                  19.1.0
pyparsing                  2.4.7
pyrsistent                 0.17.3
python-dateutil            2.8.1
python-jsonrpc-server      0.4.0
python-language-server     0.36.2
pytz                       2021.1
pyzmq                      22.0.3
requests                   2.25.1
requests-oauthlib          1.3.0
s3transfer                 0.3.6
Send2Trash                 1.5.0
setuptools                 39.0.1
six                        1.15.0
sniffio                    1.2.0
snowflake-connector-python 2.4.2
snowflake-sqlalchemy       1.2.4
SQLAlchemy                 1.4.6
terminado                  0.9.4
testpath                   0.4.4
tornado                    6.1
traitlets                  4.3.3
typing-extensions          3.7.4.3
ujson                      4.0.2
urllib3                    1.26.4
wcwidth                    0.2.5
webencodings               0.5.1
zipp                       3.4.1

The most important: snowflake-connector-python=2.4.2 pyarrow=3.0.0 and pandas=1.1.5 python version 3.6.5

First, I tried to check how new pandas behaves with parquet files, so I created two data frames with datetime objects which have a milliseconds resolution and one data frame with pandas timestamps and those are with nanoseconds resolution.

df = pd.DataFrame([('awdawd', datetime.datetime.now()), ('awdawdawd', datetime.datetime.now() - datetime.timedelta(hours=345))], columns=['id', 'date_field'])
### output:
id - object	 date_field   -  datetime64[ns]     
awdawd	        2021-04-08 14:36:53.388976     
awdawdawd	2021-03-25 05:36:53.388981    

# and with ns
df_ns = pd.DataFrame([('some_ns_dumm', pd.Timestamp(2021, 4, 4, 18, 15, 35, 345, 23)), ('more_ns_dummy', pd.Timestamp(2021, 4, 4, 18, 15, 35, 345, 23))], columns=['id', 'date_field'])

# output
id - object      date_field   -  datetime64[ns]               
some_ns_dumm	2021-04-04 18:15:35.000345023       
more_ns_dummy	2021-04-04 18:15:35.000345023

Then, I decided to store it as parquet files and look into data and metadata:

df.to_parquet('my_awasome_parquet_us_by_fact.parquet', compression='snappy')

pq.read_metadata('my_awasome_parquet_us_by_fact.parquet')

# output:
<pyarrow._parquet.FileMetaData object at 0x7ffe59282258>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1933
# -----

pq.read_schema('my_awasome_parquet_us_by_fact.parquet')
# output:
id: string
  -- field metadata --
  PARQUET:field_id: '1'
date_field: timestamp[us]
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 491
#----

table = pq.read_table('my_awasome_parquet_us_by_fact.parquet')
table.columns
# output:
[<pyarrow.lib.ChunkedArray object at 0x7ffe5afb0888>
 [
   [
     "awdawd",
     "awdawdawd"
   ]
 ],
 <pyarrow.lib.ChunkedArray object at 0x7ffe5afb0ba0>
 [
   [
     2021-04-08 14:36:53.388976,
     2021-03-25 05:36:53.388981
   ]
 ]]

As we can see, the current pandas library manages to write parquet file with microseconds resolution automatically: timestamp[us] . So, looks like the issue described in 600 is solved, but it is not, i will show it later but first let's try to do the same with another data frame:

df_ns.to_parquet('my_awasome_parquet_ns_by_fact.parquet', compression='snappy')  # this one with values that have nanoseconds

# output
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1617560135000345023

So, we got an error that it can't process with nanoseconds, so far it looks okay-ish. Now, I can upload first successful parquet manually:

with engine.begin() as conn:
    conn.execute('create temporary stage demo_db.public.parquet_test_ffr34')
    conn.execute('PUT file:///some long path/workspace/my_awasome_parquet_us_by_fact.parquet @parquet_test_ffr34')
    conn.execute("""COPY INTO oh_my_df 
                FROM @parquet_test_ffr34
                FILE_FORMAT=(TYPE = PARQUET)
                MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE 
                FORCE=TRUE;""")
    conn.execute('drop stage demo_db.public.parquet_test_ffr34')

And I can see only invalid dates as a values in the snowflake!

Then I tried to upload data using snowflake-connector methods and result was the same: invalid date:

I tried two different methods with the same DataFrame that only consists datetime objects with milliseconds resolution

write_pandas(connection, df, 'oh_my_df', 'demo_db', 'public')  # result invalid dates
df.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append') # result invalid dates

As expected when I'm trying to write DataFrame with nanoseconds values it failed with familiar exception:

write_pandas(connection, df_ns, 'oh_my_df', 'demo_db', 'public')
# output
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1617560135000345023

However, I was surprised that another method worked:

df_ns.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append')  # result -  invalid date

As a result I got invalid date in the database but it worked, I have no idea how 🤷

My next move was to try to use parquet version 2.0:

df.to_parquet('my_awasome_parquet_us_by_fact_2.parquet', compression='snappy', engine='pyarrow', version='2.0')
df_ns.to_parquet('my_awasome_parquet_ns_by_fact_2.parquet', compression='snappy', engine='pyarrow', version='2.0')

the parquet data:

file with microsecond values:
<pyarrow._parquet.FileMetaData object at 0x7ffe5b49de08>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 2.0
  serialized_size: 1931

columns:
id: string
  -- field metadata --
  PARQUET:field_id: '1'
date_field: timestamp[ns]
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 491

data:
[<pyarrow.lib.ChunkedArray object at 0x7ffe5b49de60>
 [
   [
     "awdawd",
     "awdawdawd"
   ]
 ],
 <pyarrow.lib.ChunkedArray object at 0x7ffe5b49deb8>
 [
   [
     2021-04-08 14:36:53.388976000,
     2021-03-25 05:36:53.388981000
   ]
 ]]

file with nanoseconds values:
<pyarrow._parquet.FileMetaData object at 0x7ffe5b49dfc0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 2.0
  serialized_size: 1941

schema:
id: string
  -- field metadata --
  PARQUET:field_id: '1'
date_field: timestamp[ns]
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 491

data:
[<pyarrow.lib.ChunkedArray object at 0x7ffe59282468>
 [
   [
     "some_ns_dumm",
     "more_ns_dummy"
   ]
 ],
 <pyarrow.lib.ChunkedArray object at 0x7ffe5b4a9048>
 [
   [
     2021-04-04 18:15:35.000345023,
     2021-04-04 18:15:35.000345023
   ]
 ]]

I've tried to upload v2.0 files, for example by doing this:

with engine.begin() as conn:
    conn.execute('create temporary stage demo_db.public.parquet_test_ffr36')
    conn.execute('PUT file:///long long path/workspace/my_awasome_parquet_ns_by_fact_2.parquet @parquet_test_ffr36')
    conn.execute("""COPY INTO oh_my_df 
                FROM @parquet_test_ffr36
                FILE_FORMAT=(TYPE = PARQUET)
                MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE 
                FORCE=TRUE;""")
    conn.execute('drop stage demo_db.public.parquet_test_ffr36')

but got invalid date as usual, however manual casting helps:

with engine.begin() as conn:
    conn.execute('create temporary stage demo_db.public.parquet_test_ffr36')
    conn.execute('PUT file://long long path/workspace/my_awasome_parquet_ns_by_fact_2.parquet @parquet_test_ffr36')
    conn.execute("""COPY INTO oh_my_df(id, date_field)
                FROM (SELECT $1:id::string, TO_TIMESTAMP($1:date_field::int, 9) FROM @parquet_test_ffr36)
                FILE_FORMAT=(TYPE = PARQUET)
                FORCE=TRUE;""")
    conn.execute('drop stage demo_db.public.parquet_test_ffr36')

And this was successful try I got some dates in database but values were without nanoseconds, nanoseconds were truncated but data type supposed to support those!!!!!:

So, i decided to play with all my attempts more and to cast all date values in snowflake to strings

SELECT date_field::string FROM "DEMO_DB"."PUBLIC"."OH_MY_DF";

and this is what I have got:

So, I would like to use parquet 2.0 in snowflake connector python, however the bigger issue in the Snowflake database side on my opinion. Now I need to find a place where I can submit an issue 😅

sfc-gh-mkeller · 2021-04-09T21:16:40Z

@sfc-gh-yuliu is parquet 2.0 on our radar yet?

sfc-gh-yuliu · 2021-04-09T22:30:11Z

No, we don't have any plan to update to Parquet 2.0 yet.

sfc-gh-spanaite · 2021-04-23T10:36:58Z

To clarify @plotneishestvo test

Issue with invalid date is not caused by connector. Parquet files store the timestamp with ns value as an integer, under the form: 1619025787640560
When it comes to Snowflake a datetime field can't match an integer, unless we cast it or treat it as a string. This test shows it from Snowflake UI:

SELECT 1619025787640560::datetime AS invalid_date, '1619025787640560'::datetime AS valid_date;

We get:

iamontheinet · 2022-12-13T00:13:20Z

Can we close this issue?

sfc-gh-aling · 2023-03-17T23:37:39Z

To clean up and re-prioritize bugs and feature requests we are closing all issues older than 6 months as of March 1, 2023. If there are any issues or feature requests that you would like us to address, please re-create them. For urgent issues, opening a support case with this link Snowflake Community is the fastest way to get a response.

github-actions bot changed the title ~~Pandas >= 1.1.0 timestamp support with ns resolution~~ SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution Jan 29, 2021

daniel-sali mentioned this issue Mar 25, 2021

SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600

Closed

jamiedemaria mentioned this issue Jul 6, 2022

Snowflake IO Manager handle pandas timestamps dagster-io/dagster#8760

Merged

brandon-leapyear mentioned this issue Jul 14, 2022

SNOW-582728: Invalid Pyarrow Timestamps Returned #1108

Closed

sfc-gh-aling closed this as completed Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616

SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616

plotneishestvo commented Jan 29, 2021 •

edited

Loading

daniel-sali commented Mar 25, 2021 •

edited

Loading

plotneishestvo commented Apr 7, 2021

sfc-gh-mkeller commented Apr 7, 2021

daniel-sali commented Apr 8, 2021

plotneishestvo commented Apr 8, 2021

plotneishestvo commented Apr 8, 2021

sfc-gh-mkeller commented Apr 9, 2021

sfc-gh-yuliu commented Apr 9, 2021

sfc-gh-spanaite commented Apr 23, 2021 •

edited

Loading

iamontheinet commented Dec 13, 2022

sfc-gh-aling commented Mar 17, 2023

SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616

SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616

Comments

plotneishestvo commented Jan 29, 2021 • edited Loading

daniel-sali commented Mar 25, 2021 • edited Loading

plotneishestvo commented Apr 7, 2021

sfc-gh-mkeller commented Apr 7, 2021

daniel-sali commented Apr 8, 2021

plotneishestvo commented Apr 8, 2021

plotneishestvo commented Apr 8, 2021

sfc-gh-mkeller commented Apr 9, 2021

sfc-gh-yuliu commented Apr 9, 2021

sfc-gh-spanaite commented Apr 23, 2021 • edited Loading

iamontheinet commented Dec 13, 2022

sfc-gh-aling commented Mar 17, 2023

plotneishestvo commented Jan 29, 2021 •

edited

Loading

daniel-sali commented Mar 25, 2021 •

edited

Loading

sfc-gh-spanaite commented Apr 23, 2021 •

edited

Loading