From e363a4264e8936d98c0dcf4c0bb53082f909e503 Mon Sep 17 00:00:00 2001 From: Jefffrey Date: Sun, 31 Mar 2024 11:41:11 +1100 Subject: [PATCH] ORC-1671: update timestamp doc in specfication --- site/_docs/types.md | 12 +++++++----- site/specification/ORCv1.md | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 39 insertions(+), 5 deletions(-) diff --git a/site/_docs/types.md b/site/_docs/types.md index 41f7744672..ee34811c2d 100644 --- a/site/_docs/types.md +++ b/site/_docs/types.md @@ -69,7 +69,9 @@ create table Foobar ( ORC includes two different forms of timestamps from the SQL world: -* **Timestamp** is a date and time without a time zone, which does not change based on the time zone of the reader. +* **Timestamp** is a date and time without a time zone, where the timestamp value is stored in the writer timezone +encoded at the stripe level, if present. ORC readers will read this value back into the reader's timezone. Usually +both writer and reader timezones default to UTC, however older ORC files may contain non-UTC writer timezones * **Timestamp with local time zone** is a fixed instant in time, which does change based on the time zone of the reader. Unless your application uses UTC consistently, **timestamp with @@ -78,7 +80,7 @@ use cases. When users say an event is at 10:00, it is always in reference to a certain timezone and means a point in time, rather than 10:00 in an arbitrary time zone. -| Type | Value in America/Los_Angeles | Value in America/New_York | -| ----------- | ---------------------------- | ------------------------- | -| **timestamp** | 2014-12-12 6:00:00 | 2014-12-12 6:00:00 | -| **timestamp with local time zone** | 2014-12-12 9:00:00 | 2014-12-12 6:00:00 | \ No newline at end of file +| Type | Value in America/Los_Angeles | Value in America/New_York | +| ---------------------------------- | ---------------------------- | ------------------------- | +| **timestamp** | 2014-12-12 6:00:00 | 2014-12-12 6:00:00 | +| **timestamp with local time zone** | 2014-12-12 9:00:00 | 2014-12-12 6:00:00 | diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md index c9c9311aab..5a3d2767af 100644 --- a/site/specification/ORCv1.md +++ b/site/specification/ORCv1.md @@ -1155,6 +1155,9 @@ records non-null values, a DATA stream that records the number of seconds after 1 January 2015, and a SECONDARY stream that records the number of nanoseconds. +* Note that if writer timezone is set, 1 January 2015 is according to +this timezone and not according to UTC + Because the number of nanoseconds often has a large number of trailing zeros, the number has trailing decimal zero digits removed and the last three bits are used to record how many zeros were removed. if the @@ -1170,6 +1173,35 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2 | SECONDARY | No | Unsigned Integer RLE v2 +Due to ORC-763, values before the UNIX epoch which have nanoseconds greater +than 999,999 are adjusted to have 1 second less. + +For example, given a stripe with a TIMESTAMP column with a writer timezone +of US/Pacific, and a reader timezone of UTC, we have the decoded integer values +of -1,440,851,103 from the DATA stream and 199,900,000 from the SECONDARY stream. + +First we must adjust the DATA value to be relative to the UNIX epoch. The ORC +epoch is 1 January 2015 00:00:00 US/Pacific, since we must take into account the writer +timezone. This translates to 1 January 2015 08:00:00 UTC, as US/Pacific is equivalent +to a -08:00 offset from UTC at that date (no daylight savings). The number of seconds +from 1 January 1970 00:00:00 UTC to 1 January 2015 08:00:00 UTC is 1,420,099,200. This is +added to the DATA value to produce a value of -20,751,903. As this is before the +UNIX epoch (since it is negative), and the SECONDARY value, 199,900,000, is +greater than 999,999, then this DATA value is adjusted to become -20,751,904 +(1 second subtracted). + +This value by itself represents 5 May 1969 19:34:56.1999, which now needs to be adjusted +from US/Pacific (the writer's timezone) to UTC (the reader's timezone). As the value is +within daylight savings for US/Pacific, 7 hours are subtracted to give the final value +of 5 May 1969 12:34:56.1999. + +For a TIMESTAMP_INSTANT column, this process is much simpler. Given the same values +for DATA and SECONDARY stream, and given the offset from 1 January 1970 00:00:00 UTC +to 1 January 2015 00:00:00 UTC is 1,420,070,400 seconds, we first add this to +the DATA value -1,440,851,103 to produce -20,780,703 which is then adjusted 1 second +back to -20,780,704. Paired with the SECONDARY value of 199,900,000 nanoseconds, this +represents 5 May 1969 11:34:56.1999 UTC. + ## Struct Columns Structs have no data themselves and delegate everything to their child