-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-889573: write_pandas incorrectly writes timestamp_tz #1687
Comments
I cannot seem to repro this using the same script. This is what I got:
|
What is also curious is that my return result is UTC. So I wonder if there are some different setting regarding timezone? |
Pretty sure I found the issue. A parquet file converts datetimes to UTC and stores it as TIMESTAMP_NTZ (though I don't know if this is always the case), it does however, save the timezone in the schema. When copying data from a stage into a table, Snowflake uses only the TIMESTAMP_NTZ from the parquet file and does not use the timezone indicated in the schema. Snowflake then just casts the TIMESTAMP_NTZ to TIMESTAMP_TZ using the sessions timezone. Coming back to my initial post, the datetime I had was "2023-08-09 15:05:40.206711+02:00", here "2023-08-09 15:05:40.206711" is the local time and "+02:00" is the timezone. A parquet file converts this to UTC and stores it as TIMESTAMP_NTZ: "2023-08-09 13:05:40.206711". Snowflake uses this value and just casts it to the session timezone: "+2.00" (as indicated by current_timestamp(). In your case @sfc-gh-sfan, the local time is "2023-08-09 18:34:47.424036", the parquet file converts this to UTC "2023-08-09 16:34:47.424036". Since the session timezone is UTC0, you're not running into issues. Unfortunately, Snowflake does not provide the tools to read the schema of a staged parquet file. So as a intermediate fix I will henceforth change the session timezone to UTC0. The following part of the code is concerned with this issue, I was thinking of submitting a hotfix by a pull request, but changed my mind since the problem lies with how Snowflake deals with Parquet files in general. Which is outside the scope of this git repo if I'm not mistaken. snowflake-connector-python/src/snowflake/connector/pandas_tools.py Lines 364 to 386 in 9b586d4
Code that shows Parquet files store data as UTC without timezoneRunning the code below, which creates a pandas dataframe with timezone, saves it as a parquet, and reads/prints the parquet schema and data:
outputs the following, note the "isAdjustedToUTC=true":
Code that shows Snowflake uses the datetime without timezone and just casts it to the sessions' timezone:When running the code below, which creates a pandas dataframe, stores it as parquet, then creates a temporary table and stage, puts the parquet file in the stage, copies it in the table, and prints the result:
Outputs the following:
|
Good to know that you have a workaround. I hope that after we support |
use_logical_type is a new file format option of Snowflake. It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading. The default behavior of write_pandas is unchanged. When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see snowflakedb#1687). Providing this option also fixes issue snowflakedb#1687
use_logical_type is a new file format option of Snowflake. It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading. The default behavior of write_pandas is unchanged. When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see snowflakedb#1687). Providing this option also fixes issue snowflakedb#1687
@sfc-gh-achandrasekaran, could you consider reopening the issue? Because the fix on Snowflakes' server side has not resolved this issue with write_pandas yet. |
* Add support for use_logical_type in write_pandas. use_logical_type is a new file format option of Snowflake. It is a Boolean that specifies whether Snowflake interprets Parquet logical types during data loading. The default behavior of write_pandas is unchanged. When users write a dataframe that contains datetimes with timezones and do not pass use_logical_type = True as an argument, a warning is raised (see #1687). Providing this option also fixes issue #1687 * FIX: removed pandas import and used descriptive naming over concise naming for is_datetime64tz_dtype. STYLE: if statement to idiomatic form. STYLE: broke copy_into_sql command into multiple lines, with each file_format argument on a separate line. * STYLE rearranged imports test_pandas_tools.py * REFAC: Utilized 'equal sign specifier' in f-string for improved use_logical_type warning * changelog updates --------- Co-authored-by: Dennis Van de Vorst <87502756+dvorst@users.noreply.github.com>
Merged through #1720 |
Python version
3.11.3
Operating system and processor architecture
Macbook Pro M2
Installed packages
What did you do?
Created a pandas dataframe that contains 1 timestamp, written it to snowflake, fetched it, and compared it to the initial value. The fetched value is incorrect, it's as if the time is corrected to UTC0 with the timezone left unaltered.
2023-08-09 15:05:40.206711+02:00 -> Datetime generated in python
2023-08-09 13:05:40.206711+02:00 -> Incorrect timestamp after writing/fetching to temporary table
2023-08-09 15:05:42.625000+02:00 -> Current Timestamp on Snowflake
Snowflake version: (3, 0, 3, None)
Python version: 3.11.3 (main, Apr 8 2023, 02:16:51) [Clang 14.0.0 (clang-1400.0.29.202)]
pandas version: 1.5.2
The text was updated successfully, but these errors were encountered: