Making PUDL into a Data Warehouse #1838

bendnorman · 2022-08-11T23:09:07Z

bendnorman
Aug 11, 2022
Maintainer

PUDL has two main parts, the ETL and what we call the output/analysis layer. The ETL extracts poorly structured data from excel, dbf, and xbrl files and transforms the data into well-structured relational tables. The outputs tables are a collection of tabular outputs that contain the most useful core information from the PUDL data packages, including additional keys and human-readable names for the objects. You can read more about the output and analysis layers in the docs. Currently, the data these layers produce is only accessible via the pudl python package, but we are hoping to add many of the tables to a sqlite database we can share more easily. See issue #1178.

We are applying dagster abstractions to existing code to leverage dagster CD, parallel processing of tasks, and documentation of our ETL and data assets. Issue #1487 contains our reasoning for using dagster. While we are adding dagster to our codebase, we also want to be thinking about how to improve our overall design. Issue 1401 contains helpful background on our plethora of design pain points.

Main issues:

We don’t have a uniform recipe for adding new datasets to our ETL. Each dataset is extracted and transformed in different ways. See issues 1401.
We don’t have a standard structure or criteria for adding new analyses. This discussion contains some background.
The data pipeline is not well organized in terms of what kinds of processing happen when. See issues 1401.

zaneselvans · 2022-08-15T14:40:40Z

zaneselvans
Aug 15, 2022
Maintainer

Hey Noon @silky @7imon7ays and Brent aka @turbo3136

Rehashing #1401 a bit here, we have several types of tasks / data assets that should probably get slotted into the ETL process / data warehousing process somewhere, but I'm not really sure where, or what the best way to arrange the dependencies between them will be, or what best practices around organizing the process is.

Some of the big ones:

Raw-ish extracted data

Currently we don't put the raw extracted data into the database, and it's only ephemerally available in a lot of cases.
On exception to this it he FERC Form 1, which we translate from an old VisualFoxPro DB into SQLite first.
Raw Excel and CSV inputs are available, but we don't store the early versions of them as database tables.
The various partitions of these datasets typically can't be concatenated together without doing some data cleaning first (renaming columns, ensuring columns have a single compatible type)
Tools like Airbyte seem to want to store every single partition as its own initial standalone table. That would be hundreds of tables and seems like kind of a mess.
What's the first state of the data that should get written to a DB? Right now the first step that would be easy is when we have all the partitions concatenated together with uniform column names, but the data types would still be unruly.
The raw FERC 1 DB (and all the new FERC XBRL data, which we're also able to turn into standalone SQLite DBs) could also easily be loaded into data warehouse tables.

Structural data repair or integration processes

We have a couple of ML-lite processes that we use to knit different datasets together. Right now this work happens in a bunch of different places, which is kind of a mess.

Identifying related records from multiple years of FERC 1 plant data, by calculating the cosine similarity of vectorized records and choosing unique sets of closely aligned records. Currently this happens very early in the FERC 1 transformation process and probably could / should be deferred until later. It also uses ~24GB of memory at the moment, and scales quadratically with the number of FERC steam plant records 😬.
Identifying aggregations of EIA plant records that are mostly likely to correspond to the plant data that has been reported in FERC Form 1. Currently this uses the Python Record Linkage Toolkit and a logistic regression model. This currently happens in the "analysis" layer that runs on top of the database (and also takes like 45 minutes to run from scratch and uses too much memory to run in CI, which @katie-lamb is trying to address)
Linking depreciation data collected from state PUC filings and FERC 1 to the appropriate EIA plant records. This is currently a fuzzy string-matching process that like the record linkage process above runs on top of the database.
EIA data reconciliation & normalization (aka "harvesting") which deduplicates multiply reported values from all across the EIA datasets, and generates "golden records" that are stored in entity tables representing plants, utilities, generators, boilers, balancing authorities, emissions units, cooling systems etc. -- the natural building blocks of the electricity system that the data represents.

Analysis

The "analysis" tables are typically the results of more complex calculations, that I think we'll probably want to keep using Python for. They depend on lots of other data in the database, and will end up creating a small number of new additional columns that should usually have the same primary keys as existing entities (e.g. monthly per-generator records). This includes stuff like:

Calculating the marginal cost of electricity (MCOE) on a per-unit, per-month basis, which involves some messy heuristic allocation of fuel consumption and net generation across boilers and generators for which we only have aggregated data.
Estimating hourly state-level (or someday county-level) historical electricity demand
Identifying outlying values in collections of correlated time series, and replacing them with reasonable values
Estimating the hourly carbon intensity of electricity generation within a certain geography or jurisdiction based on which units are operating at what output.

Denormalized output tables / database views

What we've called the "output" tables seem like they can be re-implemented as materialized views within the database / warehouse, implemented entirely using SQL (though we may need to bone up on SQL).
Right now, because we've historically developed most of our calculations interactively using these output tables for convenience, most of our analyses depend on them.
This seems like it might be the opposite of what we want in production though, since we'll want many derived values from more complex calculations to be available in the denormalized views for convenient use.
Or maybe there are several different layers of denormalized tables? Some that refer to the original data, and others that are built from the derived values later on?

5 replies

silky Aug 15, 2022

Can I ask a wildcard question - have you given much thought/consideration to GraphQL backends? I've spent a fair bit of time playing around with Hasura, and I really really like it. In particular I think they put a lot of effort into the ability to "link" things; so I think it could be well-suited for this space; i.e. if you go down the path, that I think is pretty reasonable, of making tables for everything (so it's saved); but then not getting too overwhelmed about how to keep them all linked and connected.

I haven't used it for this exact purpose; but it could be fun/fruitful to investigate/experiment.

zaneselvans Aug 15, 2022
Maintainer

We haven't looked at using GraphQL at all in any context, but the founder of Dagster was one of the creators of GraphQL at Facebook and I think there's a GraphQL API of some kind built into Dagster for tracking the relationships between different data assets and transformations. So I think by virtue of defining the relationships through Dagster's abstractions we'll end up with a GraphQL representation of the DAG.

Is Hasura analogous to something like FastAPI? For defining an API for external access to an existing database? We aren't (yet) trying to do that. We're still trying to just create the database itself, and step through the various kinds of transformations required. We're also trying to avoid getting tangled up in a bunch of proprietary services if we can.

silky Aug 16, 2022

I've never used FastAPI; Hasura is an (open-source; free if self-hosted, as I've used it before) GraphQL "server"; so what you do is connect it up to existing databases ( https://hasura.io/docs/latest/databases/index/ ) and then you can get unified querying interfaces. You can set it up for public consumption, but it can just as well be used for your own internal querying.

One of the reasons I'm suggesting thinking about it, is because it could mean you can kind of defer finding a "perfect" essential database structure, and just connect things up a bit more dynamically.

For example, what I'm thinking is that because with something like this you can support queries across multiple databases simultaneously, easily, the "easiest" option is just to make specific databases for each of the specific datasets, and then have this kind of "unifying" interface; that can then be versioned/deprecated/etc as time goes on.

But maybe I'm totally on the wrong mindset here ... could well be.

zaneselvans Aug 16, 2022
Maintainer

Our model so far as been not to try and host a live remote DB that users access or build on top of. Instead we archive (on Zenodo) and distribute (via an Intake data catalog) complete, versioned SQLite DBs and/or Parquet datasets, so that we can avoid the cost and responsibility of maintaining that live infrastructure, and so people don't have to work with the data over a network connection. At the moment we're only talking about a few GB of data, which is pretty convenient for a user to download once in bulk (whenever a new version is released). We've been using Simon Willison's Datasette to provide a more point-and-click interface to the DBs (since publishing them to Cloud Run is so easy). See https://data.catalyst.coop

Totally agree that if/when we are presenting a public API that we want people to build on top of, it would be great to decouple the internal structure from the external interface!

But for now, we're mainly talking about creating a "data warehouse" in the context of the data processing pipeline, for use in CI/CD, so that it has a place to persist useful intermediary data assets which, at the end of the process, we can export to the file-based assets for distribution. Of course once we've got this live, continuously updating DB, I'm sure we'll start thinking about stuff we can do with it (especially some visualizations!)

This model also seems to be getting some traction with the development of DuckDB, which is kind of a columnar / OLAP DB that mirrors a lot of the patterns established by SQLite. There was interesting discussion on the Data Engineering Podcast a few months ago.

TrentonBush Aug 23, 2022

Tools like Airbyte seem to want to store every single partition as its own initial standalone table. That would be hundreds of tables and seems like kind of a mess.

As long as we put them in their own schema/namespace, I think even a bazillion tables would be easy to ignore

zaneselvans · 2022-08-15T14:53:06Z

zaneselvans
Aug 15, 2022
Maintainer

Another thing I'm really unclear on with column-oriented Data Warehouses is how people deal with relationships between tables. It seems like data normalization and foreign key constraints kind of get tossed out the window, but these seem really important to me for knowing / requiring that the data really has the structure we think it has, that there's only one source of truth, and allowing us to find errors / outliers to correct. How do folks typically represent these constraints?

Do you typically have a nice normalized original database, which is then used to derive all the other denormalized tables downstream?

13 replies

bendnorman Aug 23, 2022
Maintainer Author

Thanks! @cmgosnell @zaneselvans do y'all have good examples of analysis work that should be integrated into our main ETL?

zaneselvans Sep 8, 2022
Maintainer

Finally getting back to this conversation! I had to go dive into Wikipedia to get more familiar with some of this vocab, which thankfully seems to map pretty directly to the problems we're trying to solve. I think they correspond roughly like:

Fact table: what we've internally often called "data tables". In EIA these
Dimension: what we've internally often called "entities" (e.g. generators, utilities, plants, and balancing authorities) whose static or slowly varying attributes are stored in our "entity tables" (plants_entity_eia, plants_eia860, generators_entity_eia, generators_eia860 etc.)
Star schema: In our output tables we do something approximately like this by bringing together the plant and utility tables (both their permanent and annually variable dimensions using the pu_eia860() function (which is mostly SQL-like stuff in pandas) before joining that information to the core data tables like generation_fuel_eia923 or generation_eia923 or fuel_receipts_costs_eia923
Snowflake schema: This seems close to what we've been imagining producing as an intermediate normalized data product, before we start creating the denormalized views which would be more like Star schemas. But since we have a variety of different "fact tables" which are often referencing shared dimensions (entities) maybe it's more like Fact constellation or Galaxy schema.

Clarifying the FERC 1 pipeline

@turbo3136 just to clarify your outline of the FERC1 pipeline a bit:

The original FERC 1 DBF to SQLite operation is almost entirely a format conversion. It just unifies a bunch of annual Visual FoxPro DBs (a format nobody supports) into a single multi-year SQLite DB that we can access easily using modern tools.
Similarly the new FERC 1 XBRL to SQLite operation translates the XBRL (a business reporting specific dialect of XML, which FERC has just started publishing via RSS) converts the nested, non-tabular XML data into another SQLite DB, so it can be queried normally.
Either of these processes could dump into one big PostgreSQL or other DB instead. Storing and publishing this intermediate format is useful because there are 100+ tables, with 3000+ columns, covering almost 30 years reporting, and because the formats they use are kind of arcane, it's really hard for most people to access for analysis, but we can't clean it all up completely -- right now we only have like 7 of the tables pulled into PUDL, and at most we might get up to 20 or so by the end of the year. Having the other 80+ raw tables available in some modern format is valuable.
Then we extract whole tables from the converted raw FERC 1 DBs and do the messy cleanup steps. Much of this is probably work that someone experienced with SQL could do directly in that language, but some steps aren't, and none of us are familiar with the tools that allow the parameterization and re-use of SQL procedures.
One challenge we face is that often the complicated data cleaning we're doing is in the service of giving the data structure that it doesn't have to begin with. In the case of FERC 1, power plants have no IDs, and their names and what portion of a plant reports together changes from year to year. So we have an involved process that classifies records from many different years as likely to be associated with each other, and algorithmically assigns a "plant ID" that links those records. This process is coordinated by pudl.transform.ferc1._plants_steam_assign_plant_ids(). Right now it happens as part of the same group of transformations that takes the data from the FERC 1 SQLite DBs to the cleaned up PUDL DB, which I think corresponds to your "pre-processed" tables, which is probably too early. Should this kind of thing be done when creating the elemental tables? Or does it only belong in the analytical tables? If anyone wants to be able to do aggregations by plant or look at time series, they need this ID.
Later on we do some messy entity matching to connect the FERC plant reporting to the EIA plant & generator reporting, and the only IDs we have connecting the two datasets are currently created by us and manually assigned. That's only possible because there's just a few hundred FERC utilities, and a few thousand plants that show up in both datasets, and it's a coarse grained connection that doesn't fully reflect the resolution of the EIA data, so there's another entity matching process that uses a logistic regression and some manually categorized training data to identify which individual generator+ownership slices available on the EIA side are most likely to correspond to whatever portion of a plant has been reported to FERC. This association is generally useful for building other analyses that involve both the kinds of data reported to FERC and EIA, so I imagine this would go into the "catalyst data mart" layer.
Insofar as it's possible I think we want to avoid saving hard-coded analyses unless it's something that's actually manual and so can't be re-run (like the FERC to EIA plant and utility mapping, or the training data mentioned above).

Other Analysis / ETL Examples

A lot of kind of work shows up in the pudl.analysis subpackage

We do some messy work to identify which boilers and generators are associated with each other within the EIA data. This adds a unit_id key to both the boiler and generator entities, which is necessary for calculating thermal efficiencies, and thus the cost effectiveness and emissions intensity of various sub-plant units. Right now this is happening in the production of the elemental tables but it could be that these boiler-generator-associations (BGAs) don't need to get built until the analytics data mart.
One analysis that the BGAs go into is the estimating the marginal cost of electricity on a per-generator or per generation-unit basis. That calculation also results in monthly capacity factors and heat rates (which are both ratios) that are important for understanding the operations of the power plants. To me this feels like deriving a new "fact table" that's based on the originally reported data -- the values are numerical and clearly associated with entities that have other attributes in the database. I've been imagining that this kind of calculation would happen after the elemental tables are loaded into the database, and would produce some new long tables with a primary key of (report month, generator ID) or (report month, generation unit ID), which would then be made available through a denormalized materialized view, alongside all of the other information that's associated with that entity and temporal granularity.
Alongside that work we also have to allocate various quantities that are reported on different bases in different data sources. E.g. only some generators report their generation directly. Others report aggregated plant-wide generation on the basis of prime mover and fuel, and we have to estimate the amount of generation (and fuel consumption) that should be associated with each collection of boilers and generators.
We also do a bunch of reconciliation of internally inconsistent data values in the data normalization / deduplication process, as we produce the elemental tables right now. E.g. if a plant reports its latitude & longitude sliiiiiightly differently from year to year, how to we choose a single canonical value to associate with the plant permanently? Or if there are different spellings of the plant name from year to year, or in different tables in the same year, which name do we want to retain for use in entity matching with other datasets?

Foreign Key Constraints & Denormalization

Reading about these various data warehouse schemas, the main point that came up for why one might want to use a more normalized structure was reduced storage space, which isn't very important these days given how cheap storage is.
To me the most useful part of enforcing our many foreign key constraints has been the way it highlights incorrect assumptions about the structural relationships between different data and entities, or how it illuminates data quality issues (bad codes in columns that are supposed to use a controlled vocabulary, alphanumeric IDs that have leading zeroes and are supposed to be treated as strings, but get converted to integers, errors that have been made in manually or algorithmically assigned IDs, etc.)
Are there common techniques for identifying this kind of data problem in a denormalized data processing environment? Beyond just requiring that the natural primary keys are present and unique within an elemental table?

turbo3136 Sep 16, 2022

@zaneselvans I just wanted to acknowledge receipt here. I was in the mountains away from internet this week, but I'll reply early next week. Apologies for being slow to respond!

turbo3136 Sep 20, 2022

To reply to a few specifics here:

pudl.transform.ferc1._plants_steam_assign_plant_ids() makes the most sense to me in the pre-processed step assuming you need the ID for other joins and transformations within your data processing (and not just for analytics afterwards). I have to create IDs in my data a lot too, and generally I do that in the pre-processed because I need them throughout the rest of my data transformations. If that's not the case for you, then it could work in your data marts. My overall goal with this structure is to keep things simple, decrease inter-dependencies, and never duplicate definitions. If you think about where this ID will be used and how it's created, will it result in a simpler DAG if you do it in pre-processed or later? My guess is it's simpler in the pre-processed.
Are there common techniques for identifying this kind of data problem in a denormalized data processing environment? Beyond just requiring that the natural primary keys are present and unique within an elemental table? Yes! I generally call these unit tests or integration tests. I use a tool called dbt (data build tool) that lets me include these tests in the documentation I create. Here's dbt's example, see the lines with "tests". You can also do this in pytest or whatever tool you prefer. But... if you already have primary/foreign keys defined and those "tests" work well for you, I don't see a reason to change that.

To step back a bit, it sounds like the biggest problems you're running into is around the structure of your transformations and having a "recipe" for adding new ones. It's easy to argue that you can do this kind of organization yourself with directory and file structures, plus comprehensive documentation, but I've found a ton of value in dbt, an open source tool that does all this for you. They recently added beta support for python so you wouldn't even need to change your code base all that much. In short, it would be a free and open source tool to help you:

structure all your files
add unit, integration, and functional tests
create documentation
automatically create lineage graphs between scripts
and other things I'm not thinking of right now

I promise I'm not paid by dbt, but it has made my life a lot easier while building out data warehouses 😄. Dagster also has an article on how it integrates with dbt.

Let me know if I missed anything, I'm more than happy to discuss this further!

tmylk Nov 6, 2022

(Apologies for the unsolicited 2 cents.)

I have had great experience building pipelines in dbt. And I like the UI and hosted-ness of DBT cloud a lot.

Main advantage of DBT is that it is SQL, and it is bite-sized small steps, so it widens the number of people who can contribute to maintaining the pipeline. Now anyone with basic SQL can chip in, not just a Python data engineer.

I imagine some of your users are technical enough to grok some SQL. Wouldn't it be nice if a domain expert in gov stats can fix the renames, change units etc themselves independently. That frees data engineer time for more complex tasks.

TrentonBush · 2022-08-23T20:13:43Z

TrentonBush
Aug 23, 2022

I like the promise of Dagster to solve all of the "visibility into intermediate outputs" issues via caching. But it also forces people to access PUDL via the development python package and pandas, which I speculate is a high bar that would drastically cut down the number of users. Based on that intuition, I'd lean towards a more limited - but easier to distribute and use - database-based caching method (AKA data warehouse/marts). It'll take some careful consideration to pick the right points in the DAG, but it seems tractable.

4 replies

bendnorman Aug 23, 2022
Maintainer Author

With Dagster, we can Materialize Assets to keep track of where data assets are created in the DAG. The assets can be written to a database so the intermediate tables can be easily distributed. I don't think Asset Materializations force people to access PUDL via the python package. Asset Materializations allow users to see where a given table is created in the DAG.

zaneselvans Aug 24, 2022
Maintainer

@TrentonBush the idea here is that all of the useful outputs that are currently either transient (like the raw extracted tables) or require the Python package (like the MCOE, state level demand allocations, denormalized output tables) would be written to the database, and thus available without Python at all, or via the minimal dependencies that the Intake catalogs require.

TrentonBush Aug 25, 2022

Oh ok that's way better than I thought. Long live Dagster!

MichaelTiemannOSC Nov 29, 2022

The Data Commons architecture project of the OS-Climate team (https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-architecture-blueprint.md) has been working on this from the opposite end (being a platform to federate/maintain/distribute data), and have made some progress with dagster, dbt, open metadata, etc. I cannot promise anything yet, but I will circulate your thoughts with our team for discussion about potential collaboration. Your ETL work is aces, and there's a lot of potential in the federated governance model we are building (which needs data to demonstrate).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Making PUDL into a Data Warehouse #1838

{{title}}

Replies: 3 comments 22 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Making PUDL into a Data Warehouse #1838

bendnorman Aug 11, 2022 Maintainer

Replies: 3 comments · 22 replies

zaneselvans Aug 15, 2022 Maintainer

Raw-ish extracted data

Structural data repair or integration processes

Analysis

Denormalized output tables / database views

zaneselvans Aug 15, 2022 Maintainer

zaneselvans Aug 16, 2022 Maintainer

zaneselvans Aug 15, 2022 Maintainer

bendnorman Aug 23, 2022 Maintainer Author

zaneselvans Sep 8, 2022 Maintainer

Clarifying the FERC 1 pipeline

Other Analysis / ETL Examples

Foreign Key Constraints & Denormalization

bendnorman Aug 23, 2022 Maintainer Author

zaneselvans Aug 24, 2022 Maintainer

bendnorman
Aug 11, 2022
Maintainer

Replies: 3 comments 22 replies

zaneselvans
Aug 15, 2022
Maintainer

zaneselvans Aug 15, 2022
Maintainer

zaneselvans Aug 16, 2022
Maintainer

zaneselvans
Aug 15, 2022
Maintainer

bendnorman Aug 23, 2022
Maintainer Author

zaneselvans Sep 8, 2022
Maintainer

bendnorman Aug 23, 2022
Maintainer Author

zaneselvans Aug 24, 2022
Maintainer