SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687

dvorst · 2023-08-09T13:10:07Z

Python version

3.11.3

Operating system and processor architecture

Macbook Pro M2

Installed packages

pandas
snowflake
pytz

What did you do?

Created a pandas dataframe that contains 1 timestamp, written it to snowflake, fetched it, and compared it to the initial value. The fetched value is incorrect, it's as if the time is corrected to UTC0 with the timezone left unaltered.

from snowflake.connector.pandas_tools import write_pandas
from snowflake import connector
import sys
import os
import pandas as pd
from datetime import datetime
import pytz

# snowflake connect configurations
snowflake_config = {
    "account": os.getenv("SF_ACCOUNT"),
    "user": os.getenv("SF_USER"),
    "role": os.getenv("SF_ROLE"),
    "password": os.getenv("SF_PASSWORD"),
    "warehouse": os.getenv("SF_WAREHOUSE"),
    "database": os.getenv("SF_DATABASE"),
    "schema": schema,
}

# create pandas dataframe with current timestamp
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})


# open connection
with connector.connect(**snowflake_config) as con:
    # create temporary table
    r = con.cursor().execute("create or replace table tmp1 (dt timestamp_tz)").fetchall()

    # write dataframe to it and immediately fetch it
    r = write_pandas(
        conn=con,
        df=df,
        table_name="tmp1".upper(),
        auto_create_table=False,
        create_temp_table=False,
    )
    r1 = con.cursor().execute("select * from tmp1").fetchall()

    # fetch current timestamp on snowflake server
    r2 = con.cursor().execute("select current_timestamp()").fetchall()

# display results
print(f"{str(df.DT[0])}  -> Datetime generated in python")
print(f"{r1[0][0]}  -> Incorrect timestamp after writing/fetching to temporary table")
print(f"{r2[0][0]}  -> Current Timestamp on Snowflake")
print(f"Snowflake version: {connector.VERSION}")
print("Python version:", sys.version)
print("pandas version:", pd.__version__)



### What did you expect to see?

Output is shown below, the second line should display the time 15:05, yet it displays the UTC0 time 13:05.

2023-08-09 15:05:40.206711+02:00 -> Datetime generated in python
2023-08-09 13:05:40.206711+02:00 -> Incorrect timestamp after writing/fetching to temporary table
2023-08-09 15:05:42.625000+02:00 -> Current Timestamp on Snowflake
Snowflake version: (3, 0, 3, None)
Python version: 3.11.3 (main, Apr 8 2023, 02:16:51) [Clang 14.0.0 (clang-1400.0.29.202)]
pandas version: 1.5.2


### Can you set logging to DEBUG and collect the logs?

```bash
import logging
import os

for logger_name in ('snowflake.connector',):
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
    logger.addHandler(ch)

The text was updated successfully, but these errors were encountered:

sfc-gh-sfan · 2023-08-09T16:36:15Z

I cannot seem to repro this using the same script. This is what I got:

2023-08-09 18:34:47.424036+02:00  -> Datetime generated in python
2023-08-09 16:34:47.424036+00:00  -> Incorrect timestamp after writing/fetching to temporary table
2023-08-09 16:34:50.371000+00:00  -> Current Timestamp on Snowflake
Snowflake version: (3, 1, 0, None)
Python version: 3.8.16 (default, Dec  7 2022, 01:36:11) 
[Clang 14.0.0 (clang-1400.0.29.202)]
pandas version: 1.5.3

sfc-gh-sfan · 2023-08-09T16:37:32Z

What is also curious is that my return result is UTC. So I wonder if there are some different setting regarding timezone?

dvorst · 2023-08-09T19:33:49Z

Pretty sure I found the issue.

A parquet file converts datetimes to UTC and stores it as TIMESTAMP_NTZ (though I don't know if this is always the case), it does however, save the timezone in the schema. When copying data from a stage into a table, Snowflake uses only the TIMESTAMP_NTZ from the parquet file and does not use the timezone indicated in the schema. Snowflake then just casts the TIMESTAMP_NTZ to TIMESTAMP_TZ using the sessions timezone.

Coming back to my initial post, the datetime I had was "2023-08-09 15:05:40.206711+02:00", here "2023-08-09 15:05:40.206711" is the local time and "+02:00" is the timezone. A parquet file converts this to UTC and stores it as TIMESTAMP_NTZ: "2023-08-09 13:05:40.206711". Snowflake uses this value and just casts it to the session timezone: "+2.00" (as indicated by current_timestamp().

In your case @sfc-gh-sfan, the local time is "2023-08-09 18:34:47.424036", the parquet file converts this to UTC "2023-08-09 16:34:47.424036". Since the session timezone is UTC0, you're not running into issues.

Unfortunately, Snowflake does not provide the tools to read the schema of a staged parquet file. So as a intermediate fix I will henceforth change the session timezone to UTC0.

The following part of the code is concerned with this issue, I was thinking of submitting a hotfix by a pull request, but changed my mind since the problem lies with how Snowflake deals with Parquet files in general. Which is outside the scope of this git repo if I'm not mistaken.

snowflake-connector-python/src/snowflake/connector/pandas_tools.py

Lines 364 to 386 in 9b586d4

    
               parquet_columns = "$1:" + ",$1:".join( 
        
                   f"{quote}{snowflake_col}{quote}::{column_type_mapping[col]}" 
        
                   for snowflake_col, col in zip(snowflake_column_names, df.columns) 
        
               ) 
        
           else: 
        
               target_table_location = build_location_helper( 
        
                   database=database, 
        
                   schema=schema, 
        
                   name=table_name, 
        
                   quote_identifiers=quote_identifiers, 
        
               ) 
        
               parquet_columns = "$1:" + ",$1:".join( 
        
                   f"{quote}{snowflake_col}{quote}" for snowflake_col in snowflake_column_names 
        
               ) 
        
           try: 
        
               copy_into_sql = ( 
        
                   f"COPY INTO {target_table_location} /* Python:snowflake.connector.pandas_tools.write_pandas() */ " 
        
                   f"({columns}) " 
        
                   f"FROM (SELECT {parquet_columns} FROM @{stage_location}) " 
        
                   f"FILE_FORMAT=(TYPE=PARQUET COMPRESSION={compression_map[compression]}{' BINARY_AS_TEXT=FALSE' if auto_create_table or overwrite else ''}) " 
        
                   f"PURGE=TRUE ON_ERROR={on_error}" 
        
               )

Code that shows Parquet files store data as UTC without timezone

Running the code below, which creates a pandas dataframe with timezone, saves it as a parquet, and reads/prints the parquet schema and data:

import pyarrow.parquet as pq

pth = 'tmp.parquet'
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
df.to_parquet(pth)
f = pq.ParquetFile(pth)

print(df.DT[0])
print('-'*150)
print(f.schema)
print('-'*150)
print(pq.ParquetFile(pth).read_row_group(0))

outputs the following, note the "isAdjustedToUTC=true":

2023-08-09 21:20:46.225089+02:00
------------------------------------------------------------------------------------------------------------------------------------------------------
<pyarrow._parquet.ParquetSchema object at 0x1620044c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 DT (Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}

------------------------------------------------------------------------------------------------------------------------------------------------------
pyarrow.Table
DT: timestamp[us, tz=Europe/Amsterdam]
----
DT: [[2023-08-09 19:20:46.225089]]

Code that shows Snowflake uses the datetime without timezone and just casts it to the sessions' timezone:

When running the code below, which creates a pandas dataframe, stores it as parquet, then creates a temporary table and stage, puts the parquet file in the stage, copies it in the table, and prints the result:

from pathlib import Path
import pandas as pd

# create pandas dataframe, save it as parquet and print its value
pth = str(Path(f"tmp.parquet").absolute())
df = pd.DataFrame({"DT": [datetime.now(tz=pytz.timezone("Europe/Amsterdam"))]})
df.to_parquet(pth)
print(df.DT[0], " original datetime")

with connector.connect(**snowflake_config) as con:
    # create table
    r = (
        con.cursor()
        .execute("create or replace temporary table tmp1 (dt timestamp_tz)")
        .fetchall()
    )
    # create temp stage
    r = (
        con.cursor()
        .execute("CREATE or replace STAGE tmp_stage file_format=(type=PARQUET)")
        .fetchall()
    )
    # put parquet file into stage
    r = con.cursor().execute(f"PUT 'file://{pth}' @tmp_stage").fetchall()
    # read the staged file
    r = con.cursor().execute(f"""select * from @tmp_stage""").fetchall()
    print(
        "The following is read from the staged file, note how it does not contain a timezone:\n",
        r[0][0],
    )
    # change the timezone to Asia
    r = (
        con.cursor()
        .execute("""ALTER SESSION SET TIMEZONE = 'Asia/Shanghai'""")
        .fetchall()
    )
    r = con.cursor().execute("""select CURRENT_TIMESTAMP()""").fetchall()
    print(r[0][0], " session timezone is changed to Changhai to show that Snowflake just casts the variable to the session timezone")

    # copy into table
    r = (
        con.cursor()
        .execute(
            """copy into tmp1 (dt) from (select $1:"DT" from @tmp_stage) file_format=(type=PARQUET)"""
        )
        .fetchall()
    )
    # read table
    r = con.cursor().execute(f"""select DT from tmp1""").fetchall()
    print(r[0][0], " datetime read from table, it contains the TIMESTAMP_NTZ value + session timezone")

Outputs the following:

2023-08-09 21:19:53.734963+02:00  original datetime
The following is read from the staged file, note how it does not contain a timezone:
 {
  "DT": "2023-08-09 19:19:53.734"
}
2023-08-10 03:19:55.879000+08:00  session timezone is changed to Changhai to show that Snowflake just casts the variable to the session timezone
2023-08-09 19:19:53.734963+08:00  datetime read from table, it contains the TIMESTAMP_NTZ value + session timezone

sfc-gh-sfan · 2023-08-10T18:52:29Z

Good to know that you have a workaround. I hope that after we support use_logical_type this can also be resolved.

use_logical_type is a new file format option of Snowflake. It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading. The default behavior of write_pandas is unchanged. When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see snowflakedb#1687). Providing this option also fixes issue snowflakedb#1687

dvorst · 2023-08-23T18:51:09Z

@sfc-gh-achandrasekaran, could you consider reopening the issue? Because the fix on Snowflakes' server side has not resolved this issue with write_pandas yet.

* Add support for use_logical_type in write_pandas. use_logical_type is a new file format option of Snowflake. It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading. The default behavior of write_pandas is unchanged. When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see #1687). Providing this option also fixes issue #1687 * FIX: removed pandas import and used descriptive naming over concise naming for is_datetime64tz_dtype. STYLE: if statement to idiomatic form. STYLE: broke copy_into_sql command into multiple lines, with each file_format argument on a separate line. * STYLE rearranged imports test_pandas_tools.py * REFAC: Utilized 'equal sign specifier' in f-string for improved use_logical_type warning * changelog updates --------- Co-authored-by: Dennis Van de Vorst <87502756+dvorst@users.noreply.github.com>

sfc-gh-mkeller · 2023-10-31T23:50:48Z

Merged through #1720

dvorst added bug needs triage labels Aug 9, 2023

github-actions bot changed the title ~~write_pandas incorrectly writes timestamp_tz~~ SNOW-889573: write_pandas incorrectly writes timestamp_tz Aug 9, 2023

dvorst mentioned this issue Aug 9, 2023

SNOW-889806: Timezone is discarded when reading from Parquet files #1688

Closed

sfc-gh-achandrasekaran closed this as completed Aug 21, 2023

dvorst mentioned this issue Aug 22, 2023

Add support for use_logical_type in write_pandas. #1706

Closed

7 tasks

dvorst mentioned this issue Aug 22, 2023

Add support for use_logical_type in write_pandas. #1707

Closed

7 tasks

sfc-gh-achandrasekaran reopened this Aug 23, 2023

sfc-gh-mkeller closed this as completed Oct 31, 2023

sfc-gh-aalam mentioned this issue Nov 2, 2023

SNOW-259668: write_pandas fails for some datetime64 values using PyArrow backend #600

Closed

jamiedemaria mentioned this issue Jan 31, 2024

[dagster-snowflake] update to use write_pandas so that we can remove snowflake-python-connector pin dagster-io/dagster#18410

Merged

frederiksteiner mentioned this issue May 23, 2024

SNOW-1444940: Write to pandas with datetime with timezone has no timezone in the resulting dataframe #1952

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687

SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687

dvorst commented Aug 9, 2023 •

edited

Loading

sfc-gh-sfan commented Aug 9, 2023

sfc-gh-sfan commented Aug 9, 2023

dvorst commented Aug 9, 2023

sfc-gh-sfan commented Aug 10, 2023

dvorst commented Aug 23, 2023

sfc-gh-mkeller commented Oct 31, 2023

SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687

SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687

Comments

dvorst commented Aug 9, 2023 • edited Loading

Python version

Operating system and processor architecture

Installed packages

What did you do?

sfc-gh-sfan commented Aug 9, 2023

sfc-gh-sfan commented Aug 9, 2023

dvorst commented Aug 9, 2023

Code that shows Parquet files store data as UTC without timezone

Code that shows Snowflake uses the datetime without timezone and just casts it to the sessions' timezone:

sfc-gh-sfan commented Aug 10, 2023

dvorst commented Aug 23, 2023

sfc-gh-mkeller commented Oct 31, 2023

dvorst commented Aug 9, 2023 •

edited

Loading