Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687

Closed
dvorst opened this issue Aug 9, 2023 · 6 comments
Closed

SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687

dvorst opened this issue Aug 9, 2023 · 6 comments

Comments

@dvorst
Copy link
Contributor

dvorst commented Aug 9, 2023

Python version

3.11.3

Operating system and processor architecture

Macbook Pro M2

Installed packages

pandas
snowflake
pytz

What did you do?

Created a pandas dataframe that contains 1 timestamp, written it to snowflake, fetched it, and compared it to the initial value. The fetched value is incorrect, it's as if the time is corrected to UTC0 with the timezone left unaltered.

from snowflake.connector.pandas_tools import write_pandas
from snowflake import connector
import sys
import os
import pandas as pd
from datetime import datetime
import pytz

# snowflake connect configurations
snowflake_config = {
    "account": os.getenv("SF_ACCOUNT"),
    "user": os.getenv("SF_USER"),
    "role": os.getenv("SF_ROLE"),
    "password": os.getenv("SF_PASSWORD"),
    "warehouse": os.getenv("SF_WAREHOUSE"),
    "database": os.getenv("SF_DATABASE"),
    "schema": schema,
}

# create pandas dataframe with current timestamp
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})


# open connection
with connector.connect(**snowflake_config) as con:
    # create temporary table
    r = con.cursor().execute("create or replace table tmp1 (dt timestamp_tz)").fetchall()

    # write dataframe to it and immediately fetch it
    r = write_pandas(
        conn=con,
        df=df,
        table_name="tmp1".upper(),
        auto_create_table=False,
        create_temp_table=False,
    )
    r1 = con.cursor().execute("select * from tmp1").fetchall()

    # fetch current timestamp on snowflake server
    r2 = con.cursor().execute("select current_timestamp()").fetchall()

# display results
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")
print(f"Snowflake version: {connector.VERSION}")
print("Python version:", sys.version)
print("pandas version:", pd.__version__)


### What did you expect to see?

Output is shown below, the second line should display the time 15:05, yet it displays the UTC0 time 13:05.

2023-08-09 15:05:40.206711+02:00 -> Datetime generated in python
2023-08-09 13:05:40.206711+02:00 -> Incorrect timestamp after writing/fetching to temporary table
2023-08-09 15:05:42.625000+02:00 -> Current Timestamp on Snowflake
Snowflake version: (3, 0, 3, None)
Python version: 3.11.3 (main, Apr 8 2023, 02:16:51) [Clang 14.0.0 (clang-1400.0.29.202)]
pandas version: 1.5.2


### Can you set logging to DEBUG and collect the logs?

```bash
import logging
import os

for logger_name in ('snowflake.connector',):
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
    logger.addHandler(ch)
@github-actions github-actions bot changed the title write_pandas incorrectly writes timestamp_tz SNOW-889573: write_pandas incorrectly writes timestamp_tz Aug 9, 2023
@sfc-gh-sfan
Copy link
Contributor

I cannot seem to repro this using the same script. This is what I got:

2023-08-09 18:34:47.424036+02:00  -> Datetime generated in python
2023-08-09 16:34:47.424036+00:00  -> Incorrect timestamp after writing/fetching to temporary table
2023-08-09 16:34:50.371000+00:00  -> Current Timestamp on Snowflake
Snowflake version: (3, 1, 0, None)
Python version: 3.8.16 (default, Dec  7 2022, 01:36:11) 
[Clang 14.0.0 (clang-1400.0.29.202)]
pandas version: 1.5.3

@sfc-gh-sfan
Copy link
Contributor

What is also curious is that my return result is UTC. So I wonder if there are some different setting regarding timezone?

@dvorst
Copy link
Contributor Author

dvorst commented Aug 9, 2023

Pretty sure I found the issue.

A parquet file converts datetimes to UTC and stores it as TIMESTAMP_NTZ (though I don't know if this is always the case), it does however, save the timezone in the schema. When copying data from a stage into a table, Snowflake uses only the TIMESTAMP_NTZ from the parquet file and does not use the timezone indicated in the schema. Snowflake then just casts the TIMESTAMP_NTZ to TIMESTAMP_TZ using the sessions timezone.

Coming back to my initial post, the datetime I had was "2023-08-09 15:05:40.206711+02:00", here "2023-08-09 15:05:40.206711" is the local time and "+02:00" is the timezone. A parquet file converts this to UTC and stores it as TIMESTAMP_NTZ: "2023-08-09 13:05:40.206711". Snowflake uses this value and just casts it to the session timezone: "+2.00" (as indicated by current_timestamp().

In your case @sfc-gh-sfan, the local time is "2023-08-09 18:34:47.424036", the parquet file converts this to UTC "2023-08-09 16:34:47.424036". Since the session timezone is UTC0, you're not running into issues.

Unfortunately, Snowflake does not provide the tools to read the schema of a staged parquet file. So as a intermediate fix I will henceforth change the session timezone to UTC0.

The following part of the code is concerned with this issue, I was thinking of submitting a hotfix by a pull request, but changed my mind since the problem lies with how Snowflake deals with Parquet files in general. Which is outside the scope of this git repo if I'm not mistaken.

parquet_columns = "$1:" + ",$1:".join(
f"{quote}{snowflake_col}{quote}::{column_type_mapping[col]}"
for snowflake_col, col in zip(snowflake_column_names, df.columns)
)
else:
target_table_location = build_location_helper(
database=database,
schema=schema,
name=table_name,
quote_identifiers=quote_identifiers,
)
parquet_columns = "$1:" + ",$1:".join(
f"{quote}{snowflake_col}{quote}" for snowflake_col in snowflake_column_names
)
try:
copy_into_sql = (
f"COPY INTO {target_table_location} /* Python:snowflake.connector.pandas_tools.write_pandas() */ "
f"({columns}) "
f"FROM (SELECT {parquet_columns} FROM @{stage_location}) "
f"FILE_FORMAT=(TYPE=PARQUET COMPRESSION={compression_map[compression]}{' BINARY_AS_TEXT=FALSE' if auto_create_table or overwrite else ''}) "
f"PURGE=TRUE ON_ERROR={on_error}"
)

Code that shows Parquet files store data as UTC without timezone

Running the code below, which creates a pandas dataframe with timezone, saves it as a parquet, and reads/prints the parquet schema and data:

import pyarrow.parquet as pq

pth = 'tmp.parquet'
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
df.to_parquet(pth)
f = pq.ParquetFile(pth)

print(df.DT[0])
print('-'*150)
print(f.schema)
print('-'*150)
print(pq.ParquetFile(pth).read_row_group(0))

outputs the following, note the "isAdjustedToUTC=true":

2023-08-09 21:20:46.225089+02:00
------------------------------------------------------------------------------------------------------------------------------------------------------
<pyarrow._parquet.ParquetSchema object at 0x1620044c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 DT (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}

------------------------------------------------------------------------------------------------------------------------------------------------------
pyarrow.Table
DT: timestamp[us, tz=Europe/Amsterdam]
----
DT: [[2023-08-09 19:20:46.225089]]

Code that shows Snowflake uses the datetime without timezone and just casts it to the sessions' timezone:

When running the code below, which creates a pandas dataframe, stores it as parquet, then creates a temporary table and stage, puts the parquet file in the stage, copies it in the table, and prints the result:

from pathlib import Path
import pandas as pd

# create pandas dataframe, save it as parquet and print its value
pth = str(Path(f"tmp.parquet").absolute())
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
df.to_parquet(pth)
print(df.DT[0], " original datetime")

with connector.connect(**snowflake_config) as con:
    # create table
    r = (
        con.cursor()
        .execute("create or replace temporary table tmp1 (dt timestamp_tz)")
        .fetchall()
    )
    # create temp stage
    r = (
        con.cursor()
        .execute("CREATE or replace STAGE tmp_stage file_format=(type=PARQUET)")
        .fetchall()
    )
    # put parquet file into stage
    r = con.cursor().execute(f"PUT 'file://{pth}' @tmp_stage").fetchall()
    # read the staged file
    r = con.cursor().execute(f"""select * from @tmp_stage""").fetchall()
    print(
        "The following is read from the staged file, note how it does not contain a timezone:\n",
        r[0][0],
    )
    # change the timezone to Asia
    r = (
        con.cursor()
        .execute("""ALTER SESSION SET TIMEZONE = 'Asia/Shanghai'""")
        .fetchall()
    )
    r = con.cursor().execute("""select CURRENT_TIMESTAMP()""").fetchall()
    print(r[0][0], " session timezone is changed to Changhai to show that Snowflake just casts the variable to the session timezone")

    # copy into table
    r = (
        con.cursor()
        .execute(
            """copy into tmp1 (dt) from (select $1:"DT" from @tmp_stage) file_format=(type=PARQUET)"""
        )
        .fetchall()
    )
    # read table
    r = con.cursor().execute(f"""select DT from tmp1""").fetchall()
    print(r[0][0], " datetime read from table, it contains the TIMESTAMP_NTZ value + session timezone")

Outputs the following:

2023-08-09 21:19:53.734963+02:00  original datetime
The following is read from the staged file, note how it does not contain a timezone:
 {
  "DT": "2023-08-09 19:19:53.734"
}
2023-08-10 03:19:55.879000+08:00  session timezone is changed to Changhai to show that Snowflake just casts the variable to the session timezone
2023-08-09 19:19:53.734963+08:00  datetime read from table, it contains the TIMESTAMP_NTZ value + session timezone

@sfc-gh-sfan
Copy link
Contributor

Good to know that you have a workaround. I hope that after we support use_logical_type this can also be resolved.

dvorst added a commit to dvorst/snowflake-connector-python that referenced this issue Aug 22, 2023
use_logical_type is a new file format option of Snowflake.
It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading.
The default behavior of write_pandas is unchanged.
When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see snowflakedb#1687).
Providing this option also fixes issue snowflakedb#1687
dvorst added a commit to dvorst/snowflake-connector-python that referenced this issue Aug 22, 2023
use_logical_type is a new file format option of Snowflake.
It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading.
The default behavior of write_pandas is unchanged.
When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see snowflakedb#1687).
Providing this option also fixes issue snowflakedb#1687
@dvorst
Copy link
Contributor Author

dvorst commented Aug 23, 2023

@sfc-gh-achandrasekaran, could you consider reopening the issue? Because the fix on Snowflakes' server side has not resolved this issue with write_pandas yet.

sfc-gh-aalam added a commit that referenced this issue Oct 19, 2023
* Add support for use_logical_type in write_pandas.

use_logical_type is a new file format option of Snowflake.
It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading.
The default behavior of write_pandas is unchanged.
When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see #1687).
Providing this option also fixes issue #1687

* FIX: removed pandas import and used descriptive
naming over concise naming for is_datetime64tz_dtype.

STYLE: if statement to idiomatic form.

STYLE: broke copy_into_sql command into multiple lines,
with each file_format argument on a separate line.

* STYLE rearranged imports test_pandas_tools.py

* REFAC: Utilized 'equal sign specifier' in f-string for
improved use_logical_type warning

* changelog updates

---------

Co-authored-by: Dennis Van de Vorst <87502756+dvorst@users.noreply.github.com>
@sfc-gh-mkeller
Copy link
Collaborator

Merged through #1720

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants