Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(deltalake): don't overwrite newer rows if we can avoid it #251

Merged
merged 1 commit into from
Jul 25, 2023

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Jul 21, 2023

If both the lake table and the update rows have meta.lastUpdated, only push the update row if it is actually a newer version.

This should let folks with meta.lastUpdated support more safely mix and match batches of ndjson without having to stress about when each batch was exported from the EHR.

Unfortunately Epic doesn't provide this field, so this benefit is limited to other EHRs.

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

@mikix mikix force-pushed the mikix/respect-last-updated branch 2 times, most recently from 833e7f8 to 822eea2 Compare July 24, 2023 19:50
@mikix mikix changed the title WIP: feat(deltalake): don't overwrite newer rows if we can avoid it feat(deltalake): don't overwrite newer rows if we can avoid it Jul 24, 2023
@mikix mikix marked this pull request as ready for review July 24, 2023 20:07
Comment on lines -255 to +272
global _first_header
if not _first_header:
print("###############################################################")
_first_header = False
rich.get_console().rule()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first header stuff felt less important now that the horizontal rule is prettier. Now I kind of like the separator between your command and the output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of new divider line:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh yeah i like it

@@ -97,6 +98,63 @@ def test_added_field(self):
self.store(self.df(b={"one": 1, "two": 2}))
self.assert_lake_equal(self.df(a={"one": 1}, b={"one": 1, "two": 2}))

def test_last_updated_support(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me brainstorm any situations I might have forgotten, plz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need do check any date only/partial dates here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A) probably not, since this field is an instant field per the spec and that means it must provide at least second accuracy (though can optionally provide subsecond accuracy)

B) but why not, I threw a line in this test for it - it seems to be handled OK

) -> loaders.Directory:
) -> common.Directory:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This little refactor (moving the Directory and RealDirectory prototypes out to common) is largely unrelated, but was done to reduce inter-module dependencies and avoid a circular dependency (caused by cli_utils being used by other siblings to loaders).

Comment on lines +179 to +180
# The cast-as-timestamp does not seem to noticeably slow us down.
# If it becomes an issue, we could always actually convert this string column to a real date/time column.
Copy link
Contributor Author

@mikix mikix Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the performance of adding this new update-condition on BCH's 2nd-biggest table (medicationrequest). Here are some average runs of 100k batches:

Original Code:        374s avg (419+362+354+362)
New code, no dates:   377s avg (425+365+358+360)
New code, old dates:  386s avg (418+366+386+372)
New code, new dates:  381s avg (422+367+368+367)
New code, same dates: 381s avg (434+368+362+360)

So even in the worst test (386s), it's only 3% slower. And I suspect that it's just normal variance.

Notably, the "same date" case (which saw no table modifications at all) was just as slow. Seems like the time cost is mostly overhead / id-matching.

Comment on lines -255 to +272
global _first_header
if not _first_header:
print("###############################################################")
_first_header = False
rich.get_console().rule()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh yeah i like it

Comment on lines 49 to 50
One reason Epic presumably doesn't offer `_since` support is because they also don't provide
metadata about when each record was last updated (`meta.lastUpdated` in FHIR).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my lawyer alert is triggering on this - we should maybe trim the assumption and just state the fact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - done.

@@ -97,6 +98,63 @@ def test_added_field(self):
self.store(self.df(b={"one": 1, "two": 2}))
self.assert_lake_equal(self.df(a={"one": 1}, b={"one": 1, "two": 2}))

def test_last_updated_support(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need do check any date only/partial dates here?

If both the lake table and the update rows have meta.lastUpdated,
only push the update row if it is actually a newer version.

This should let folks with meta.lastUpdated support more safely mix
and match batches of ndjson without having to stress about when
each batch was exported from the EHR.

Unfortunately Epic doesn't provide this field, so this benefit is
limited to other EHRs.
@mikix mikix force-pushed the mikix/respect-last-updated branch from 822eea2 to 1b024e7 Compare July 25, 2023 17:25
@mikix mikix merged commit 8c11f10 into main Jul 25, 2023
2 checks passed
@mikix mikix deleted the mikix/respect-last-updated branch July 25, 2023 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants