Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DailyMed XML processing for NDC -> image #309

Closed
jrlegrand opened this issue Jul 16, 2024 · 0 comments · Fixed by #326
Closed

DailyMed XML processing for NDC -> image #309

jrlegrand opened this issue Jul 16, 2024 · 0 comments · Fixed by #326
Assignees

Comments

@jrlegrand
Copy link
Member

jrlegrand commented Jul 16, 2024

Problem Statement

Need to extract a linkage from NDC -> image file name from DailyMed XML.

Criteria for Success

Data mart for NDC -> image

Additional Information

I looked through DailyMed's SPL stylesheet

  • I think there's some neat tricks we can learn about XML from these, and my main takeaway is that if we can really understand how DailyMed crafts their XML template for their website, that's the closest source of truth
  • Specifically for the ObservationMedia stuff, i think we are doing basically what DailyMed is doing mostly - though there's some specialized stuff they are doing that may or may not be important

Probably the bigger question is how we tackle the final piece of consuming the focused XML sections (gleaned / transformed / compiled using XSLT templates) from each pathway.

If we need to OCR images, does that mean we need to unzip all the zip files to get the images out? not sure how much storage space that would take up, but assuming it would be pretty large. Would it make more sense to try to OCR a hosted image instead of the local image? We could get the DailyMed image URL from the XML and maybe point the OCR tool at that URL instead of a local file? There's also a lot of images that have nothing to do with labels (i.e. chemical structure or administration instruction diagrams) that we don't need to bother with unzipping and/or OCR-ing.

If we leave everything zipped (as we do currently), we could spit out a smaller, more focused XML document that Python/pandas can pick up and parse through pretty easily with XPath to create the columns in a dataframe. I am doing the equivalent of this currently in my branch (https://github.com/coderxio/sagerx/tree/jrlegrand/dailymed), but using SQL. Meaning - the smaller XML document is stored in an xml column in Postgres, and then dbt models use SQL to do essentially what pandas would do to convert the smaller XML document to columns in one or more tables.

  • Using pandas would mean these tables are materialized.
  • Using dbt means we can decide whether we want them to be materialized in the sagerx_lake schema (this might be a weird use of dbt - maybe they would end up as materialized staging tables in sagerx_dev), or whether we want them to be normal staging views in sagerx_dev.

I don't know what the performance or memory usage limitations would be for both of these options, but assume it might be better to go the pandas route for memory reasons.... not sure. I did run into an error (#238) when originally trying to load ALL SPLs, but things have changed since then which may make that error moot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
2 participants