Refactor `ratio_stats` job to use Pyspark #436

dfsnow · 2024-05-09T15:01:50Z

The recently merged #422 has a Python dbt model (ratio_stats.py) that runs on Athena's Spark backend. The model almost exclusively uses pandas for data munging and processing. This works well and is simple, but misses out on some of the benefits of using Spark (parallelization). We should try a quick refactor of the ratio_stats model using PySpark code to see if we can gain some of the benefits of Spark. Mainly, the current Pandas job takes 1 hour to finish, while the Spark version is likely to be much faster.

We can also make a few other enhancements here at the same time. Namely:

Change the data types of the ratio_stats table to be slightly more sensible
Possibly factor out the ratio_stats_input table entirely

These will need input from @ccao-jardine and @wrridgeway.

The text was updated successfully, but these errors were encountered:

dfsnow · 2024-06-07T18:43:21Z

@wagnerlmichael This one is yours now. Let's use it to pilot use of Spark models within dbt, since we may want to convert sales val, source-of-truth, etc to Spark. Let's also take this opportunity to clean up the ratio_stats table a little bit (get the dtypes corrected, drop extraneous columns, etc.).

ccao-jardine · 2024-06-12T15:40:40Z

ratio_stats is used in production for our public-facing ratio study dashboards, which are published for mailed stage each reassessed township when it mails.

This is one dashboard serving all townships, with an extract of the ratio_stats table that is refreshed with each 2024 township mailing. Because of that I'd very strongly prefer to not make any changes to the production table until after we have mailed the last tri town this year.

If changes must be made now because it's blocking other work, please sequence with me on schedule so that changes aren't pushed close to a town mail date.

If it helps, the current structure of the reporting depends on no changes (data type, etc.) to the following columns in the production table:

geography_id
property_group
assessment_stage
sale_year
sale_n
detect_chasing
med_ratio, cod, prb, prd
ratio_met, cod_met, prb_met, prd_met

This table is filtered to geography_type = "Town", so if other types are added, it should be robust to those changes.

Which extraneous columns are you thinking of getting rid of?

dfsnow · 2024-06-12T16:44:26Z

Got it. @wagnerlmichael don't mess with any of the column dtypes. We'll move the cleanup stuff to a separate issue.

dfsnow added the dbt Related to dbt (tests, docs, schema, etc) label May 9, 2024

dfsnow self-assigned this May 9, 2024

dfsnow assigned wagnerlmichael and unassigned dfsnow Jun 7, 2024

wagnerlmichael linked a pull request Jun 25, 2024 that will close this issue

Refactor ratio stats for build speed increase #521

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `ratio_stats` job to use Pyspark #436

Refactor `ratio_stats` job to use Pyspark #436

dfsnow commented May 9, 2024 •

edited

Loading

dfsnow commented Jun 7, 2024

ccao-jardine commented Jun 12, 2024

dfsnow commented Jun 12, 2024

Refactor ratio_stats job to use Pyspark #436

Refactor ratio_stats job to use Pyspark #436

Comments

dfsnow commented May 9, 2024 • edited Loading

dfsnow commented Jun 7, 2024

ccao-jardine commented Jun 12, 2024

dfsnow commented Jun 12, 2024

Refactor `ratio_stats` job to use Pyspark #436

Refactor `ratio_stats` job to use Pyspark #436

dfsnow commented May 9, 2024 •

edited

Loading