-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DailyMed NDC to Label Image Mart #326
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Created an XPath for Media that looks for ObservationMedia and then grabs an image file name (if it exists - need to build a test for if it exists to reduce the noise probably) and also the entire text of the component. Next step is to build a dbt staging model to RegEx the NDC out of the <Text/> element since XPath doesn't natively support that.
Changed template to look specifically at the package label display panel section(s) in the SPL for images. Also updated the staging table to have nested XMLTABLE commands (thanks ChatGPT).
Created an XPath for Media that looks for ObservationMedia and then grabs an image file name (if it exists - need to build a test for if it exists to reduce the noise probably) and also the entire text of the component. Next step is to build a dbt staging model to RegEx the NDC out of the <Text/> element since XPath doesn't natively support that.
Changed template to look specifically at the package label display panel section(s) in the SPL for images. Also updated the staging table to have nested XMLTABLE commands (thanks ChatGPT).
jrlegrand
changed the title
DailyMed NDC to Image Mart
DailyMed NDC to Label Image Mart
Oct 24, 2024
lprzychodzien
approved these changes
Oct 30, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves #309
Resolves #322
Explanation
Took the approach of using FTP to download all desired DailyMed SPL zip files. Can specify in the DAG whether you want all human rx / 1 out of the 5 human rx / OTC / etc. By default it pulls all human rx.
Extract:
DAG will unzip all outer zip files into one folder leaving a folder (i.e. data/dailymed/prescription) full of thousands of zip files.
Load:
DAG will peek inside of each zip file and unzip just the XML document to the folder. Then it will parse the XML document using a custom XSLT template (dags/dailymed/template.xsl). It will then delete that XML file and move on to the next zip file. It will store the resulting XML created from the template in a list of dicts and when finished will convert that to a Pandas dataframe and then load it into Postgres. Any changes to the initial XML parsing need to be in the template.xml file for now. Optimization will be to modularize this a bit so different parts might be in different XSL files.
NOTE: see example XML template output at the bottom of this PR.
Transform:
This is probably the most unusual part, but the part I like the best. Instead of using Python or other methods to transform the resulting smaller XML document, I use dbt data models (using PostgreSQL XML functions) to transform the XML in the data lake in a stepwise manner using staging and intermediate tables that can be checked along the way for troubleshooting purposes.
Rough transform workflow:
Rationale
DailyMed SPLs have label images for many drug products, but they are not at the NDC level - they are at the SPL level. To get to NDC-level images, you need to do something along the lines of what we've done here.
NDC-level images are useful for drug purchasing or basic drug information about what an NDC looks like.
Tests
Ran DAG to completion and built marts using dbt run --select ndcs_to_label_images
This produced around 57k NDC -> label image matches at the time of writing this PR
I had run this DAG several times before and each time, I compared outputs manually to try to validate that I wasn't breaking anything that worked before and was actually adding new matches. I feel like I'm at a stable point where enough is working well that this should finally be merged to main.
Future Enhancements
NOTE: every time this DAG is run, we currently need to manually DROP/CASCADE from the sagerx_lake.dailymed table to avoid duplication. This needs to be addressed.
ALSO NOTE: I think if we expand from just all human rx to all human rx and OTC, something weird happens with the folders during extract or load. If we expand to human and OTC this needs to be fixed.
General optimizations:
example XML template output
NOTE: the important parts for this work are everything inside
<PackageLabels />
.<MediaList />
contains a list of all images found directly within or referenced from the package label section. We try to associate this with any NDCs parsed out of the text of the section and also try to parse NDCs directly from the image name (i.e. sometimes images are named "12345-456-2.jpg".<Text />
is the raw text of the section that we parse for NDCs using RegEx in a dbt data model<ID />
is the ID of the sectionThere can be multiple package label sections. In this example, there is only one.
<NDCList /
> is also relevant to this work. It contains the list of NDCs represented by the SPL overall. It is used to validate any NDCs we parse out of the text of the package label section.