Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1671: update timestamp doc in specfication #1867

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions site/_docs/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,9 @@ create table Foobar (

ORC includes two different forms of timestamps from the SQL world:

* **Timestamp** is a date and time without a time zone, which does not change based on the time zone of the reader.
* **Timestamp** is a date and time without a time zone, where the timestamp value is stored in the writer timezone
encoded at the stripe level, if present. ORC readers will read this value back into the reader's timezone. Usually
both writer and reader timezones default to UTC, however older ORC files may contain non-UTC writer timezones
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually
both writer and reader timezones default to UTC, however older ORC files may contain non-UTC writer timezones`

The main purpose of this timestamp type is to restore the wall clock time regardless of the reader time zone. IIRC, the Java implementation uses the writer local time zone instead of UTC. Due to insufficient C++ time zone utility support in earlier days, the C++ code by default uses UTC as the writer time zone to avoid complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was confused as from the discussion/comments here: apache/arrow#34591

It suggested that the reader's timezone was important when deserializing the values

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we'd better not expose the implementation detail here. The original summary and the table below well explains what to expect from users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with that, as this timestamp type has been a great source of confusion for me when attempting to implement an ORC to Arrow reader in Rust.

For example, in the test file TestOrcFile.testDate2038.orc, according to PyArrow (which uses the ORC C++ reader underneath), the first value is:

>> from pyarrow import orc
>>> orc.read_table("/home/jeffrey/Code/orc/examples/TestOrcFile.testDate2038.orc")[0][0]
<pyarrow.TimestampScalar: '2038-05-05T12:34:56.100000000'>

The underlying integer value for the DATA stream is 736_601_696 seconds (let's ignore nanoseconds). When adding this to 1 Jan 2015 00:00:00, regardless of timezone, we get 5 May 2038 11:34:56. Which seems... surprising.

This isn't at all what I would expect hence my confusion (perhaps I just don't understand properly how DST factors into this). But diving into the internals and seeing reader & writer timezone offsets being considered when decoding the timestamp value just adds to the confusion, especially as that contradicts the specification.

I was hoping to open up some discussion on clarifying the exact definition for this timestamp type as part of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. If you find any inconsistent behavior between C++ and Java sides, please refer to the Java implementation as reference. We need to fix any issue on the C++ side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try check more into this to confirm my understanding 👍

Moving to draft in the meantime

* **Timestamp with local time zone** is a fixed instant in time, which does change based on the time zone of the reader.

Unless your application uses UTC consistently, **timestamp with
Expand All @@ -78,7 +80,7 @@ use cases. When users say an event is at 10:00, it is always in
reference to a certain timezone and means a point in time, rather than
10:00 in an arbitrary time zone.

| Type | Value in America/Los_Angeles | Value in America/New_York |
| ----------- | ---------------------------- | ------------------------- |
| **timestamp** | 2014-12-12 6:00:00 | 2014-12-12 6:00:00 |
| **timestamp with local time zone** | 2014-12-12 9:00:00 | 2014-12-12 6:00:00 |
| Type | Value in America/Los_Angeles | Value in America/New_York |
| ---------------------------------- | ---------------------------- | ------------------------- |
| **timestamp** | 2014-12-12 6:00:00 | 2014-12-12 6:00:00 |
| **timestamp with local time zone** | 2014-12-12 9:00:00 | 2014-12-12 6:00:00 |
32 changes: 32 additions & 0 deletions site/specification/ORCv1.md
Original file line number Diff line number Diff line change
Expand Up @@ -1155,6 +1155,9 @@ records non-null values, a DATA stream that records the number of
seconds after 1 January 2015, and a SECONDARY stream that records the
number of nanoseconds.

* Note that if writer timezone is set, 1 January 2015 is according to
this timezone and not according to UTC
Comment on lines +1158 to +1159
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to implementation:

orc/c++/src/Timezone.hh

Lines 66 to 72 in 3b5b2a6

/**
* Get the number of seconds between the ORC epoch in this timezone
* and Unix epoch.
* ORC epoch is 1 Jan 2015 00:00:00 local.
* Unix epoch is 1 Jan 1970 00:00:00 UTC.
*/
virtual int64_t getEpoch() const = 0;

int64_t writerTime = secsBuffer[i] + epochOffset;

Also, I'm not certain if the writer timezone is mandatory in the stripe if a TIMESTAMP column is present in the file. Might need some clarification on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, writer timezone field is mandatory if there is any timestamp type:

if (writeTimeZone) {
if (useUTCTimeZone) {
builder.setWriterTimezone("UTC");
} else {
builder.setWriterTimezone(TimeZone.getDefault().getID());
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think I'll add this to the spec as well (maybe proto too, in the other repo?)


Because the number of nanoseconds often has a large number of trailing
zeros, the number has trailing decimal zero digits removed and the
last three bits are used to record how many zeros were removed. if the
Expand All @@ -1170,6 +1173,35 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
| SECONDARY | No | Unsigned Integer RLE v2

Due to ORC-763, values before the UNIX epoch which have nanoseconds greater
than 999,999 are adjusted to have 1 second less.
Comment on lines +1176 to +1177
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to

orc/c++/src/ColumnReader.cc

Lines 350 to 352 in 9b79de9

if (secsBuffer[i] < 0 && nanoBuffer[i] > 999999) {
secsBuffer[i] -= 1;
}


For example, given a stripe with a TIMESTAMP column with a writer timezone
of US/Pacific, and a reader timezone of UTC, we have the decoded integer values
of -1,440,851,103 from the DATA stream and 199,900,000 from the SECONDARY stream.

First we must adjust the DATA value to be relative to the UNIX epoch. The ORC
epoch is 1 January 2015 00:00:00 US/Pacific, since we must take into account the writer
timezone. This translates to 1 January 2015 08:00:00 UTC, as US/Pacific is equivalent
to a -08:00 offset from UTC at that date (no daylight savings). The number of seconds
from 1 January 1970 00:00:00 UTC to 1 January 2015 08:00:00 UTC is 1,420,099,200. This is
added to the DATA value to produce a value of -20,751,903. As this is before the
UNIX epoch (since it is negative), and the SECONDARY value, 199,900,000, is
greater than 999,999, then this DATA value is adjusted to become -20,751,904
(1 second subtracted).

This value by itself represents 5 May 1969 19:34:56.1999, which now needs to be adjusted
from US/Pacific (the writer's timezone) to UTC (the reader's timezone). As the value is
within daylight savings for US/Pacific, 7 hours are subtracted to give the final value
of 5 May 1969 12:34:56.1999.
Comment on lines +1183 to +1196
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda unclear on how this exactly works, I'm just going off the C++ implementation here:

orc/c++/src/ColumnReader.cc

Lines 335 to 348 in 9b79de9

int64_t writerTime = secsBuffer[i] + epochOffset;
if (!sameTimezone) {
// adjust timestamp value to same wall clock time if writer and reader
// time zones have different rules, which is required for Apache Orc.
const auto& wv = writerTimezone->getVariant(writerTime);
const auto& rv = readerTimezone->getVariant(writerTime);
if (!wv.hasSameTzRule(rv)) {
// If the timezone adjustment moves the millis across a DST boundary,
// we need to reevaluate the offsets.
int64_t adjustedTime = writerTime + wv.gmtOffset - rv.gmtOffset;
const auto& adjustedReader = readerTimezone->getVariant(adjustedTime);
writerTime = writerTime + wv.gmtOffset - adjustedReader.gmtOffset;
}
}

Happy to be corrected if my understanding or wording is inaccurate anywhere 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I'm not inclined to add these details to the specs if we are not 100% sure about it.


For a TIMESTAMP_INSTANT column, this process is much simpler. Given the same values
for DATA and SECONDARY stream, and given the offset from 1 January 1970 00:00:00 UTC
to 1 January 2015 00:00:00 UTC is 1,420,070,400 seconds, we first add this to
the DATA value -1,440,851,103 to produce -20,780,703 which is then adjusted 1 second
back to -20,780,704. Paired with the SECONDARY value of 199,900,000 nanoseconds, this
represents 5 May 1969 11:34:56.1999 UTC.

## Struct Columns

Structs have no data themselves and delegate everything to their child
Expand Down
Loading