Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution #616

Closed
plotneishestvo opened this issue Jan 29, 2021 · 11 comments
Closed

Comments

@plotneishestvo
Copy link

plotneishestvo commented Jan 29, 2021

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using (python --version)?
    Python 3.6.5

  2. What operating system and processor architecture are you using (python -c 'import platform; print(platform.platform())')?
    Darwin-19.6.0-x86_64-i386-64bit

  3. What are the component versions in the environment (pip freeze)?

``` appnope==0.1.0 asn1crypto==1.4.0 attrs==19.3.0 azure-common==1.1.25 azure-core==1.8.0 azure-storage-blob==12.4.0 backcall==0.1.0 bleach==3.1.4 boto3==1.14.42 botocore==1.17.42 certifi==2020.4.5.1 cffi==1.14.1 chardet==3.0.4 credstash==1.17.1 cryptography==2.9.2 cycler==0.10.0 decorator==4.4.2 defusedxml==0.6.0 docutils==0.15.2 entrypoints==0.3 idna==2.9 importlib-metadata==1.6.0 ipykernel==5.2.0 ipython==7.13.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isodate==0.6.0 jedi==0.16.0 Jinja2==2.11.1 jmespath==0.10.0 json5==0.9.4 jsonschema==3.2.0 jupyter-client==6.1.2 jupyter-core==4.6.3 jupyterlab==2.1.0 jupyterlab-server==1.1.0 kiwisolver==1.2.0 MarkupSafe==1.1.1 matplotlib==3.2.1 mistune==0.8.4 mock==4.0.2 msrest==0.6.18 nbconvert==5.6.1 nbformat==5.0.5 notebook==6.0.3 numpy==1.18.2 oauthlib==3.1.0 oscrypto==1.2.1 packaging==20.4 pandas==1.1.5 pandocfilters==1.4.2 parso==0.6.2 pexpect==4.8.0 pickleshare==0.7.5 plotly==4.6.0 prometheus-client==0.7.1 prompt-toolkit==3.0.5 psycopg2-binary==2.8.6 ptyprocess==0.6.0 pyarrow==0.17.1 pycparser==2.20 pycryptodomex==3.9.8 Pygments==2.6.1 PyJWT==1.7.1 pyOpenSSL==19.1.0 pyparsing==2.4.7 pyrsistent==0.16.0 python-dateutil==2.8.1 pytz==2019.3 pyzmq==19.0.0 requests==2.23.0 requests-oauthlib==1.3.0 retrying==1.3.3 s3transfer==0.3.3 Send2Trash==1.5.0 six==1.14.0 snowflake-connector-python==2.3.9 snowflake-sqlalchemy==1.2.4 SQLAlchemy==1.3.22 sqlalchemy-redshift==0.8.1 terminado==0.8.3 testpath==0.4.4 tornado==6.0.4 traitlets==4.3.3 urllib3==1.25.8 wcwidth==0.1.9 webencodings==0.5.1 widgetsnbextension==3.5.1 zipp==3.1.0 ```
  1. What did you do?
    I'm just writing pandas dataframe with datetime field in snowflake using pd_writer
    code:
import datetime
import pandas as pd
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import pd_writer

engine = create_engine('<my awasome connection URL>')

df = pd.DataFrame([('awdawd', datetime.datetime.now()), ('awdawdawd', datetime.datetime.now() - datetime.timedelta(hours=345))], columns=['id', 'date_field'])
df.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append')
  1. What did you expect to see?
    I'm expecting to see in database the same datetime values as I see in the dataframe

  2. What did you see instead?
    i see invalid date values instead

This is continuation of another issue: #600 I'm not sure if it is possible to solve on the snowflake-connector level but looks like pandas >= 1.1.0 using parquet v 2.0 to store successfully sores timestamp values in ns resolution, however when tried to upload parquet files 2.0 manually I have the same result.

It would be nice to either find a solution how to work with parquet v2.0 somehow or with nanosecond resolution.

@github-actions github-actions bot changed the title Pandas >= 1.1.0 timestamp support with ns resolution SNOW-273899: Pandas >= 1.1.0 timestamp support with ns resolution Jan 29, 2021
@daniel-sali
Copy link

daniel-sali commented Mar 25, 2021

Pandas 1.1.0+ by default is writing Parquet files on a microsecond resolution, you need to explicitly pass version="2.0" to write seconds on a nanosecond resolution (this limits the writable min/max timestamps due to the INT64 representation, I assume).
Example:
df.to_parquet(chunk_path, compression="snappy", index=index, version="2.0")
Source: https://github.com/pandas-dev/pandas/blob/7d9dd1a68334cd7a8ed19629c63e84713abc77ef/pandas/tests/io/test_parquet.py#L882

The write_pandas and pd_writer snowflake-connector-python functions should receive an optional argument whether to use nanoseconds when writing the temporary Parquet files, which would set the version="2.0" flag in

chunk.to_parquet(chunk_path, compression=compression)
.

Additionally, the COPY INTO statement needs to be updated in

parquet_columns = "$1:" + ",$1:".join(df.columns)
to use TO_TIMESTAMP($1:"microsecond_tst_col"::INT,6) or TO_TIMESTAMP($1:"nanosecond_tst_col"::INT,9) depending on whether the version="2.0" flag was used to write the temporary Parquet file.

@plotneishestvo
Copy link
Author

Sorry, I was wrong in description It actually uses parquet version=1.0 by default. However new pandas with pyarrow manages to work normally with nanoseconds time resolution but Snowflake can't automatically cast data fields and as a result I can see invalid date... Even in manual copying I'm trying to use flag to match columns by name, and using TO_TIMESTAMP($1:"nanosecond_tst_col"::INT,9)can help if I parse manually but It would be nice to have auto casting working right since column type in table already has nanoseconds resolution by default.

@sfc-gh-mkeller
Copy link
Collaborator

to use TO_TIMESTAMP($1:"microsecond_tst_col"::INT,6) or TO_TIMESTAMP($1:"nanosecond_tst_col"::INT,9) depending on whether the version="2.0" flag was used to write the temporary Parquet file.

@daniel-sali doing this automatically in the library seems error prone and it will likely lead to more issues in the future, do you guys think that adding an option for the user to override the COPY INTO statement generated by our pandas_tools would be a good solution for this?

@daniel-sali
Copy link

@sfc-gh-mkeller , we have actually implemented the COPY INTO statement generation logic in our connector library, which creates a connection to Snowflake acting as our data warehouse.
See https://github.com/nuvolos-cloud/python-connector/blob/master/nuvolos/sql_utils.py , entry point is the to_sql function.
The code is based on the write_pandas function and pandas' io/sql.py for DataFrame column type -> Snowflake column type mapping.

@plotneishestvo
Copy link
Author

Actually, I think this is a Snowflake database bug but not a connector issue. I tried to test different ways of loading data into Snowflake with the latest available libraries in my Jupyter lab and It looks like pandas can handle ms timestamps now and write it in the right datatype into parquet files but snowflake gets it as an invalid date. So. Shortly I think the issue on the Snowflake side and I need to figure out a way how to report a bug. However, I will share all details step by step in the next comment.

@plotneishestvo
Copy link
Author

I created a new virtual environment and installed the latest packages from pip. Here you can see pip list output:

Package                    Version
-------------------------- ---------
anyio                      2.2.0
appnope                    0.1.2
argon2-cffi                20.1.0
asn1crypto                 1.4.0
async-generator            1.10
attrs                      20.3.0
azure-common               1.1.27
azure-core                 1.13.0
azure-storage-blob         12.8.0
Babel                      2.9.0
backcall                   0.2.0
bleach                     3.3.0
boto3                      1.17.45
botocore                   1.20.45
certifi                    2020.12.5
cffi                       1.14.5
chardet                    3.0.4
contextvars                2.4
cryptography               3.4.7
dataclasses                0.8
decorator                  5.0.5
defusedxml                 0.7.1
entrypoints                0.3
greenlet                   1.0.0
idna                       2.10
immutables                 0.15
importlib-metadata         3.10.0
ipykernel                  5.5.3
ipython                    7.16.1
ipython-genutils           0.2.0
isodate                    0.6.0
jedi                       0.17.2
Jinja2                     2.11.3
jmespath                   0.10.0
json5                      0.9.5
jsonschema                 3.2.0
jupyter-client             6.1.12
jupyter-core               4.7.1
jupyter-lsp                1.1.4
jupyter-packaging          0.7.12
jupyter-server             1.5.1
jupyterlab                 3.0.12
jupyterlab-lsp             3.5.0
jupyterlab-pygments        0.1.2
jupyterlab-server          2.4.0
MarkupSafe                 1.1.1
mistune                    0.8.4
msrest                     0.6.21
nbclassic                  0.2.6
nbclient                   0.5.3
nbconvert                  6.0.7
nbformat                   5.1.3
nest-asyncio               1.5.1
notebook                   6.3.0
numpy                      1.19.5
oauthlib                   3.1.0
oscrypto                   1.2.1
packaging                  20.9
pandas                     1.1.5
pandocfilters              1.4.3
parso                      0.7.1
pexpect                    4.8.0
pickleshare                0.7.5
pip                        21.0.1
pluggy                     0.13.1
prometheus-client          0.10.0
prompt-toolkit             3.0.18
ptyprocess                 0.7.0
pyarrow                    3.0.0
pycparser                  2.20
pycryptodomex              3.10.1
Pygments                   2.8.1
PyJWT                      2.0.1
pyOpenSSL                  19.1.0
pyparsing                  2.4.7
pyrsistent                 0.17.3
python-dateutil            2.8.1
python-jsonrpc-server      0.4.0
python-language-server     0.36.2
pytz                       2021.1
pyzmq                      22.0.3
requests                   2.25.1
requests-oauthlib          1.3.0
s3transfer                 0.3.6
Send2Trash                 1.5.0
setuptools                 39.0.1
six                        1.15.0
sniffio                    1.2.0
snowflake-connector-python 2.4.2
snowflake-sqlalchemy       1.2.4
SQLAlchemy                 1.4.6
terminado                  0.9.4
testpath                   0.4.4
tornado                    6.1
traitlets                  4.3.3
typing-extensions          3.7.4.3
ujson                      4.0.2
urllib3                    1.26.4
wcwidth                    0.2.5
webencodings               0.5.1
zipp                       3.4.1

The most important: snowflake-connector-python=2.4.2 pyarrow=3.0.0 and pandas=1.1.5 python version 3.6.5

First, I tried to check how new pandas behaves with parquet files, so I created two data frames with datetime objects which have a milliseconds resolution and one data frame with pandas timestamps and those are with nanoseconds resolution.

df = pd.DataFrame([('awdawd', datetime.datetime.now()), ('awdawdawd', datetime.datetime.now() - datetime.timedelta(hours=345))], columns=['id', 'date_field'])
### output:
id - object	 date_field   -  datetime64[ns]     
awdawd	        2021-04-08 14:36:53.388976     
awdawdawd	2021-03-25 05:36:53.388981    

# and with ns
df_ns = pd.DataFrame([('some_ns_dumm', pd.Timestamp(2021, 4, 4, 18, 15, 35, 345, 23)), ('more_ns_dummy', pd.Timestamp(2021, 4, 4, 18, 15, 35, 345, 23))], columns=['id', 'date_field'])

# output
id - object      date_field   -  datetime64[ns]               
some_ns_dumm	2021-04-04 18:15:35.000345023       
more_ns_dummy	2021-04-04 18:15:35.000345023

Then, I decided to store it as parquet files and look into data and metadata:

df.to_parquet('my_awasome_parquet_us_by_fact.parquet', compression='snappy')

pq.read_metadata('my_awasome_parquet_us_by_fact.parquet')

# output:
<pyarrow._parquet.FileMetaData object at 0x7ffe59282258>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 1933
# -----

pq.read_schema('my_awasome_parquet_us_by_fact.parquet')
# output:
id: string
  -- field metadata --
  PARQUET:field_id: '1'
date_field: timestamp[us]
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 491
#----

table = pq.read_table('my_awasome_parquet_us_by_fact.parquet')
table.columns
# output:
[<pyarrow.lib.ChunkedArray object at 0x7ffe5afb0888>
 [
   [
     "awdawd",
     "awdawdawd"
   ]
 ],
 <pyarrow.lib.ChunkedArray object at 0x7ffe5afb0ba0>
 [
   [
     2021-04-08 14:36:53.388976,
     2021-03-25 05:36:53.388981
   ]
 ]]

As we can see, the current pandas library manages to write parquet file with microseconds resolution automatically: timestamp[us] . So, looks like the issue described in 600 is solved, but it is not, i will show it later but first let's try to do the same with another data frame:

df_ns.to_parquet('my_awasome_parquet_ns_by_fact.parquet', compression='snappy')  # this one with values that have nanoseconds

# output
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1617560135000345023

So, we got an error that it can't process with nanoseconds, so far it looks okay-ish. Now, I can upload first successful parquet manually:

with engine.begin() as conn:
    conn.execute('create temporary stage demo_db.public.parquet_test_ffr34')
    conn.execute('PUT file:///some long path/workspace/my_awasome_parquet_us_by_fact.parquet @parquet_test_ffr34')
    conn.execute("""COPY INTO oh_my_df 
                FROM @parquet_test_ffr34
                FILE_FORMAT=(TYPE = PARQUET)
                MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE 
                FORCE=TRUE;""")
    conn.execute('drop stage demo_db.public.parquet_test_ffr34')

And I can see only invalid dates as a values in the snowflake!
Screenshot 2021-04-08 at 14 33 50

Then I tried to upload data using snowflake-connector methods and result was the same: invalid date:

I tried two different methods with the same DataFrame that only consists datetime objects with milliseconds resolution

write_pandas(connection, df, 'oh_my_df', 'demo_db', 'public')  # result invalid dates
df.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append') # result invalid dates

As expected when I'm trying to write DataFrame with nanoseconds values it failed with familiar exception:

write_pandas(connection, df_ns, 'oh_my_df', 'demo_db', 'public')
# output
ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1617560135000345023

However, I was surprised that another method worked:

df_ns.to_sql('oh_my_df', engine, index=False, method=pd_writer, if_exists='append')  # result -  invalid date

As a result I got invalid date in the database but it worked, I have no idea how 🤷

My next move was to try to use parquet version 2.0:

df.to_parquet('my_awasome_parquet_us_by_fact_2.parquet', compression='snappy', engine='pyarrow', version='2.0')
df_ns.to_parquet('my_awasome_parquet_ns_by_fact_2.parquet', compression='snappy', engine='pyarrow', version='2.0')

the parquet data:

file with microsecond values:
<pyarrow._parquet.FileMetaData object at 0x7ffe5b49de08>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 2.0
  serialized_size: 1931

columns:
id: string
  -- field metadata --
  PARQUET:field_id: '1'
date_field: timestamp[ns]
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 491

data:
[<pyarrow.lib.ChunkedArray object at 0x7ffe5b49de60>
 [
   [
     "awdawd",
     "awdawdawd"
   ]
 ],
 <pyarrow.lib.ChunkedArray object at 0x7ffe5b49deb8>
 [
   [
     2021-04-08 14:36:53.388976000,
     2021-03-25 05:36:53.388981000
   ]
 ]]

file with nanoseconds values:
<pyarrow._parquet.FileMetaData object at 0x7ffe5b49dfc0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 2.0
  serialized_size: 1941

schema:
id: string
  -- field metadata --
  PARQUET:field_id: '1'
date_field: timestamp[ns]
  -- field metadata --
  PARQUET:field_id: '2'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 491

data:
[<pyarrow.lib.ChunkedArray object at 0x7ffe59282468>
 [
   [
     "some_ns_dumm",
     "more_ns_dummy"
   ]
 ],
 <pyarrow.lib.ChunkedArray object at 0x7ffe5b4a9048>
 [
   [
     2021-04-04 18:15:35.000345023,
     2021-04-04 18:15:35.000345023
   ]
 ]]

I've tried to upload v2.0 files, for example by doing this:

with engine.begin() as conn:
    conn.execute('create temporary stage demo_db.public.parquet_test_ffr36')
    conn.execute('PUT file:///long long path/workspace/my_awasome_parquet_ns_by_fact_2.parquet @parquet_test_ffr36')
    conn.execute("""COPY INTO oh_my_df 
                FROM @parquet_test_ffr36
                FILE_FORMAT=(TYPE = PARQUET)
                MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE 
                FORCE=TRUE;""")
    conn.execute('drop stage demo_db.public.parquet_test_ffr36')

but got invalid date as usual, however manual casting helps:

with engine.begin() as conn:
    conn.execute('create temporary stage demo_db.public.parquet_test_ffr36')
    conn.execute('PUT file://long long path/workspace/my_awasome_parquet_ns_by_fact_2.parquet @parquet_test_ffr36')
    conn.execute("""COPY INTO oh_my_df(id, date_field)
                FROM (SELECT $1:id::string, TO_TIMESTAMP($1:date_field::int, 9) FROM @parquet_test_ffr36)
                FILE_FORMAT=(TYPE = PARQUET)
                FORCE=TRUE;""")
    conn.execute('drop stage demo_db.public.parquet_test_ffr36')

And this was successful try I got some dates in database but values were without nanoseconds, nanoseconds were truncated but data type supposed to support those!!!!!:
Screenshot 2021-04-08 at 17 55 00

So, i decided to play with all my attempts more and to cast all date values in snowflake to strings

SELECT date_field::string FROM "DEMO_DB"."PUBLIC"."OH_MY_DF";

and this is what I have got:
Screenshot 2021-04-08 at 17 57 48

So, I would like to use parquet 2.0 in snowflake connector python, however the bigger issue in the Snowflake database side on my opinion. Now I need to find a place where I can submit an issue 😅

@sfc-gh-mkeller
Copy link
Collaborator

@sfc-gh-yuliu is parquet 2.0 on our radar yet?

@sfc-gh-yuliu
Copy link

No, we don't have any plan to update to Parquet 2.0 yet.

@sfc-gh-spanaite
Copy link

sfc-gh-spanaite commented Apr 23, 2021

To clarify @plotneishestvo test

Issue with invalid date is not caused by connector. Parquet files store the timestamp with ns value as an integer, under the form: 1619025787640560
When it comes to Snowflake a datetime field can't match an integer, unless we cast it or treat it as a string. This test shows it from Snowflake UI:

SELECT 1619025787640560::datetime AS invalid_date, '1619025787640560'::datetime AS valid_date;

We get:

Screenshot 2021-04-23 at 12 35 39

@iamontheinet
Copy link
Member

Can we close this issue?

@sfc-gh-aling
Copy link
Collaborator

To clean up and re-prioritize bugs and feature requests we are closing all issues older than 6 months as of March 1, 2023. If there are any issues or feature requests that you would like us to address, please re-create them. For urgent issues, opening a support case with this link Snowflake Community is the fastest way to get a response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants