Making PUDL into a Data Warehouse #1838
Replies: 3 comments 22 replies
-
Hey Noon @silky @7imon7ays and Brent aka @turbo3136 Rehashing #1401 a bit here, we have several types of tasks / data assets that should probably get slotted into the ETL process / data warehousing process somewhere, but I'm not really sure where, or what the best way to arrange the dependencies between them will be, or what best practices around organizing the process is. Some of the big ones: Raw-ish extracted data
Structural data repair or integration processesWe have a couple of ML-lite processes that we use to knit different datasets together. Right now this work happens in a bunch of different places, which is kind of a mess.
AnalysisThe "analysis" tables are typically the results of more complex calculations, that I think we'll probably want to keep using Python for. They depend on lots of other data in the database, and will end up creating a small number of new additional columns that should usually have the same primary keys as existing entities (e.g. monthly per-generator records). This includes stuff like:
Denormalized output tables / database views
|
Beta Was this translation helpful? Give feedback.
-
Another thing I'm really unclear on with column-oriented Data Warehouses is how people deal with relationships between tables. It seems like data normalization and foreign key constraints kind of get tossed out the window, but these seem really important to me for knowing / requiring that the data really has the structure we think it has, that there's only one source of truth, and allowing us to find errors / outliers to correct. How do folks typically represent these constraints? Do you typically have a nice normalized original database, which is then used to derive all the other denormalized tables downstream? |
Beta Was this translation helpful? Give feedback.
-
I like the promise of Dagster to solve all of the "visibility into intermediate outputs" issues via caching. But it also forces people to access PUDL via the development python package and pandas, which I speculate is a high bar that would drastically cut down the number of users. Based on that intuition, I'd lean towards a more limited - but easier to distribute and use - database-based caching method (AKA data warehouse/marts). It'll take some careful consideration to pick the right points in the DAG, but it seems tractable. |
Beta Was this translation helpful? Give feedback.
-
PUDL has two main parts, the ETL and what we call the output/analysis layer. The ETL extracts poorly structured data from excel, dbf, and xbrl files and transforms the data into well-structured relational tables. The outputs tables are a collection of tabular outputs that contain the most useful core information from the PUDL data packages, including additional keys and human-readable names for the objects. You can read more about the output and analysis layers in the docs. Currently, the data these layers produce is only accessible via the pudl python package, but we are hoping to add many of the tables to a sqlite database we can share more easily. See issue #1178.
We are applying dagster abstractions to existing code to leverage dagster CD, parallel processing of tasks, and documentation of our ETL and data assets. Issue #1487 contains our reasoning for using dagster. While we are adding dagster to our codebase, we also want to be thinking about how to improve our overall design. Issue 1401 contains helpful background on our plethora of design pain points.
Main issues:
Beta Was this translation helpful? Give feedback.
All reactions