-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(deltalake): don't overwrite newer rows if we can avoid it #251
Conversation
833e7f8
to
822eea2
Compare
global _first_header | ||
if not _first_header: | ||
print("###############################################################") | ||
_first_header = False | ||
rich.get_console().rule() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first header stuff felt less important now that the horizontal rule is prettier. Now I kind of like the separator between your command and the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh yeah i like it
@@ -97,6 +98,63 @@ def test_added_field(self): | |||
self.store(self.df(b={"one": 1, "two": 2})) | |||
self.assert_lake_equal(self.df(a={"one": 1}, b={"one": 1, "two": 2})) | |||
|
|||
def test_last_updated_support(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help me brainstorm any situations I might have forgotten, plz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need do check any date only/partial dates here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A) probably not, since this field is an instant
field per the spec and that means it must provide at least second accuracy (though can optionally provide subsecond accuracy)
B) but why not, I threw a line in this test for it - it seems to be handled OK
) -> loaders.Directory: | ||
) -> common.Directory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This little refactor (moving the Directory
and RealDirectory
prototypes out to common
) is largely unrelated, but was done to reduce inter-module dependencies and avoid a circular dependency (caused by cli_utils
being used by other siblings to loaders
).
# The cast-as-timestamp does not seem to noticeably slow us down. | ||
# If it becomes an issue, we could always actually convert this string column to a real date/time column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the performance of adding this new update-condition on BCH's 2nd-biggest table (medicationrequest
). Here are some average runs of 100k batches:
Original Code: 374s avg (419+362+354+362)
New code, no dates: 377s avg (425+365+358+360)
New code, old dates: 386s avg (418+366+386+372)
New code, new dates: 381s avg (422+367+368+367)
New code, same dates: 381s avg (434+368+362+360)
So even in the worst test (386s), it's only 3% slower. And I suspect that it's just normal variance.
Notably, the "same date" case (which saw no table modifications at all) was just as slow. Seems like the time cost is mostly overhead / id-matching.
global _first_header | ||
if not _first_header: | ||
print("###############################################################") | ||
_first_header = False | ||
rich.get_console().rule() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooh yeah i like it
docs/vendors/epic.md
Outdated
One reason Epic presumably doesn't offer `_since` support is because they also don't provide | ||
metadata about when each record was last updated (`meta.lastUpdated` in FHIR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my lawyer alert is triggering on this - we should maybe trim the assumption and just state the fact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - done.
@@ -97,6 +98,63 @@ def test_added_field(self): | |||
self.store(self.df(b={"one": 1, "two": 2})) | |||
self.assert_lake_equal(self.df(a={"one": 1}, b={"one": 1, "two": 2})) | |||
|
|||
def test_last_updated_support(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need do check any date only/partial dates here?
If both the lake table and the update rows have meta.lastUpdated, only push the update row if it is actually a newer version. This should let folks with meta.lastUpdated support more safely mix and match batches of ndjson without having to stress about when each batch was exported from the EHR. Unfortunately Epic doesn't provide this field, so this benefit is limited to other EHRs.
822eea2
to
1b024e7
Compare
If both the lake table and the update rows have
meta.lastUpdated
, only push the update row if it is actually a newer version.This should let folks with
meta.lastUpdated
support more safely mix and match batches of ndjson without having to stress about when each batch was exported from the EHR.Unfortunately Epic doesn't provide this field, so this benefit is limited to other EHRs.
Checklist
docs/
) needs to be updated