feat(deltalake): don't overwrite newer rows if we can avoid it #251

mikix · 2023-07-21T17:42:15Z

If both the lake table and the update rows have meta.lastUpdated, only push the update row if it is actually a newer version.

This should let folks with meta.lastUpdated support more safely mix and match batches of ndjson without having to stress about when each batch was exported from the EHR.

Unfortunately Epic doesn't provide this field, so this benefit is limited to other EHRs.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

mikix · 2023-07-21T17:42:57Z

cumulus_etl/common.py

-    global _first_header
-    if not _first_header:
-        print("###############################################################")
-    _first_header = False
+    rich.get_console().rule()


The first header stuff felt less important now that the horizontal rule is prettier. Now I kind of like the separator between your command and the output.

Example of new divider line:

ooh yeah i like it

mikix · 2023-07-21T17:43:34Z

tests/test_deltalake.py

@@ -97,6 +98,63 @@ def test_added_field(self):
        self.store(self.df(b={"one": 1, "two": 2}))
        self.assert_lake_equal(self.df(a={"one": 1}, b={"one": 1, "two": 2}))

+    def test_last_updated_support(self):


Help me brainstorm any situations I might have forgotten, plz

do we need do check any date only/partial dates here?

A) probably not, since this field is an instant field per the spec and that means it must provide at least second accuracy (though can optionally provide subsecond accuracy)

B) but why not, I threw a line in this test for it - it seems to be handled OK

mikix · 2023-07-24T19:55:30Z

cumulus_etl/chart_review/cli.py

-) -> loaders.Directory:
+) -> common.Directory:


This little refactor (moving the Directory and RealDirectory prototypes out to common) is largely unrelated, but was done to reduce inter-module dependencies and avoid a circular dependency (caused by cli_utils being used by other siblings to loaders).

mikix · 2023-07-24T20:06:45Z

cumulus_etl/formats/deltalake.py

+        # The cast-as-timestamp does not seem to noticeably slow us down.
+        # If it becomes an issue, we could always actually convert this string column to a real date/time column.


I tested the performance of adding this new update-condition on BCH's 2nd-biggest table (medicationrequest). Here are some average runs of 100k batches:

Original Code: 374s avg (419+362+354+362) New code, no dates: 377s avg (425+365+358+360) New code, old dates: 386s avg (418+366+386+372) New code, new dates: 381s avg (422+367+368+367) New code, same dates: 381s avg (434+368+362+360)

So even in the worst test (386s), it's only 3% slower. And I suspect that it's just normal variance.

Notably, the "same date" case (which saw no table modifications at all) was just as slow. Seems like the time cost is mostly overhead / id-matching.

dogversioning · 2023-07-25T15:08:02Z

cumulus_etl/common.py

-    global _first_header
-    if not _first_header:
-        print("###############################################################")
-    _first_header = False
+    rich.get_console().rule()


ooh yeah i like it

dogversioning · 2023-07-25T15:49:06Z

docs/vendors/epic.md

+One reason Epic presumably doesn't offer `_since` support is because they also don't provide
+metadata about when each record was last updated (`meta.lastUpdated` in FHIR).


my lawyer alert is triggering on this - we should maybe trim the assumption and just state the fact?

Good idea - done.

dogversioning · 2023-07-25T15:57:54Z

tests/test_deltalake.py

@@ -97,6 +98,63 @@ def test_added_field(self):
        self.store(self.df(b={"one": 1, "two": 2}))
        self.assert_lake_equal(self.df(a={"one": 1}, b={"one": 1, "two": 2}))

+    def test_last_updated_support(self):


do we need do check any date only/partial dates here?

If both the lake table and the update rows have meta.lastUpdated, only push the update row if it is actually a newer version. This should let folks with meta.lastUpdated support more safely mix and match batches of ndjson without having to stress about when each batch was exported from the EHR. Unfortunately Epic doesn't provide this field, so this benefit is limited to other EHRs.

mikix force-pushed the mikix/respect-last-updated branch 2 times, most recently from 833e7f8 to 822eea2 Compare July 24, 2023 19:50

mikix changed the title ~~WIP: feat(deltalake): don't overwrite newer rows if we can avoid it~~ feat(deltalake): don't overwrite newer rows if we can avoid it Jul 24, 2023

mikix marked this pull request as ready for review July 24, 2023 20:07

mikix commented Jul 24, 2023

View reviewed changes

dogversioning approved these changes Jul 25, 2023

View reviewed changes

mikix force-pushed the mikix/respect-last-updated branch from 822eea2 to 1b024e7 Compare July 25, 2023 17:25

mikix merged commit 8c11f10 into main Jul 25, 2023
2 checks passed

mikix deleted the mikix/respect-last-updated branch July 25, 2023 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(deltalake): don't overwrite newer rows if we can avoid it #251

feat(deltalake): don't overwrite newer rows if we can avoid it #251

mikix commented Jul 21, 2023 •

edited

Loading

mikix Jul 21, 2023

mikix Jul 24, 2023

dogversioning Jul 25, 2023

mikix Jul 21, 2023

dogversioning Jul 25, 2023

mikix Jul 25, 2023

mikix Jul 24, 2023

mikix Jul 24, 2023 •

edited

Loading

dogversioning Jul 25, 2023

dogversioning Jul 25, 2023

mikix Jul 25, 2023

dogversioning Jul 25, 2023

		# The cast-as-timestamp does not seem to noticeably slow us down.
		# If it becomes an issue, we could always actually convert this string column to a real date/time column.

		One reason Epic presumably doesn't offer `_since` support is because they also don't provide
		metadata about when each record was last updated (`meta.lastUpdated` in FHIR).

feat(deltalake): don't overwrite newer rows if we can avoid it #251

feat(deltalake): don't overwrite newer rows if we can avoid it #251

Conversation

mikix commented Jul 21, 2023 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikix Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikix commented Jul 21, 2023 •

edited

Loading

mikix Jul 24, 2023 •

edited

Loading