Bus cost refactor #1161

csuyat-dot · 2024-06-28T21:11:14Z

Revisited the Bus procurement cost analysis to refactor/clean up all the scripts and final notebook.

Overall steps taken this round of refactor

created bus_cost_utils module to move all the common functions and variables.
adjusted cleaner scripts to reference moduel.
gave cleaned datasets consistent naming convention (raw, cleaned, bus only) and identical column names to merge on.
final, merged dataset contains columnsfor z-score and an outlier flag. 1 set was saved with outliers, another saved with out outliers.
used the merged dataset without outliers for the final analysis notebook to create all pivot tables, charts and variables.
streamlined final notebook to show mainly ZEB related metrics, used more charts and tables to display information and not be so narrative heavy.
deleted old, initial exploratory notebooks.
reorganized GCS folder by moving old initial exports to an /old folder.

…also started adding updated scripts to draft NB

…al cost and cpb number still match! also had to edit the nb script file to read in old/ folder in gcs

…d outfile file names for scripts. testing calculating cost per bus on merged df to identify/remove outliers first, then trying the new agg function

…ethod of finding outliers first is working. reran all the agg by __ functions and they all work and match

…ng down the summary section

…nd for charts

… a column in a df

…riables, graphs and functions to the zeb projects variable. also more organizing

…od to go and is exporting to GCS

…CS. GTG

…ipt ran no errors. files saved to GCS. GTG

…nal NB. still need to get charts moved over

… overall charts

…to execute the final NB and export a html

…r files.

github-actions · 2024-06-28T21:12:50Z

nbviewer URLs for impacted notebooks:

tiffanychu90 · 2024-07-01T18:18:25Z

bus_procurement_cost/dgs_data_cleaner.py

@@ -2,7 +2,7 @@
 import pandas as pd
 import shared_utils
 from calitp_data_analysis.sql import to_snakecase
-
+from bus_cost_utils import *


I would change imports to be written either like import bus_cost_utils and use functions like bus_cost_utils.new_prop_finder() or only import specific functions (if you really only need a couple things) like from bus_cost_utils import new_prop_finder().

Generally, being clear on where a function is imported from is crucial, because as soon as you have multiple utils functions, you can easily lose track of where something is. In a universe with a lot of functions, writing from pandas import * and from numpy import * would create clashes because they might even share functions with the same name (sum, mean, etc)!`

If you can change your import statements, that would make functions / where they belong very clear!

tiffanychu90 · 2024-07-01T18:29:17Z

bus_procurement_cost/bus_cost_utils.py

+#        data[data[col1] == data[col1].min()][col_list]
+#                  )
+
+def outlier_flag(col):


def outlier_flag(df: pd.DataFrame, col: str) -> pd.DataFrame: # This applies the lambda function for you already, also worked in the absolute value # There are 2 ways to write this, to do the same thing return df.apply(lambda x: True if abs(x[col]) > 3 else False, axis=1) OR something like (double check) df[col].apply(lambda x: True if abs(x) > 3 else False) # Also, if you don't like booleans, you can do `.astype(int)` and it'll change True/False or 1/0 (in that order)

In your notebook:

df_agg["new_is_cpb_outlier"] = outlier_flag(df_agg, "new_zscore_cost_per_bus")

The .apply takes a lambda function, which operates on a row. There are 2 ways to write it, and this all depends on what you need to access in the row. If you have just 1 column (z-score), you can write it either way. If you need to access 2 column values, you will have to write it like df.apply(lambda x: some condition, axis=1)

Here, we want to access 2 columns in the lambda condition (state and temperature)
Ex: df.apply(lambda x: 1 if ( (x.state == "CA" ) and (x.temperature < 80) ) else 0, axis=1)

The difference in syntax is that you place the .apply in a different place, and there's also the axis=1 (operate on row) that's present.

tiffanychu90 · 2024-07-01T18:59:15Z

Overall, great refactor! A lot more clear and simple to follow. Everything under the if __name__ == "__main__": to create your datasets makes a lot of sense now, and I can follow which subsets had which transformations.

Carry forward: clear imports, now that you have your own functions, import them when you use them. You'll soon have more utility functions than just 1 script, so you'll want to make it easy for you and others to find where those functions are called from within a notebook.
This is a clarification note that you can store (it'll take some time for it to sink in) on how to understand df.apply stuff.

Strictly speaking, the df.apply operates on a row. If a df looks like this and we want to flag high temperature rows, we might set up a function called flag_high_temps.

state	temperature
CA	90
NV	95
OR	80
TX	90

def flag_high_temps(col):
   return col >= 90

The apply is operating on each row. It's looking at the temperature column, but going row-wise, and checking whether 90 >= 90 (True), 95 >= 90 (True), (80 >= 90 (False)`, etc.

So here, I've kept the function exactly as it was, but simply relabeled the argument and indicated the type is integer.

def flag_high_temps(temperature_value: int):
   return temperature_value >= 90

Where col is typically used is in conjunction with a df. Now, the lambda statement when I use flag_high_temps(df, "temperature") says, look at df and for each row, find the column named temperature, and if that column value is >= 90, return True. If not, return False. Now, col can have the type hint of str, because "temperature" is the name of the column and that name is a string.

def flag_high_temps(df: pd.DataFrame, col: str):
   return df.apply(lambda row: True if row[col] >= 90 else False, axis=1)

tiffanychu90 and others added 30 commits June 13, 2024 20:31

add refactor concepts, notes

258d87d

started NB for all refactor work

50b56d7

renamed old NBs to seperate from current work

0326a43

started new bus_cost_utils.py to start dropping in shared functions. …

da829dd

…also started adding updated scripts to draft NB

testing new function to flag if cpb is outlier

de61eae

comparing outliers in new and old DFs

16fc0df

improved cpb aggregate function

f6a8867

tested new cpb_aggregate function against old version. bus count, tot…

359e707

…al cost and cpb number still match! also had to edit the nb script file to read in old/ folder in gcs

testng new ways to reduce variables in favor of pivot tables

0b239b9

comparing pivot tables against new cpb agg function. updated input an…

82d6181

…d outfile file names for scripts. testing calculating cost per bus on merged df to identify/remove outliers first, then trying the new agg function

ran updated scripts, everything exported to gcs with no errors. new m…

173694d

…ethod of finding outliers first is working. reran all the agg by __ functions and they all work and match

made sure charts and graphs are still workinh. started work on trimmi…

cd2c247

…ng down the summary section

more changes

519b07b

switching the weighted average caclulation for average cost per bus a…

1887b14

…nd for charts

more organization

7312850

started writing conclusion. created new function to min/max values of…

877d41f

… a column in a df

consolidated some of the summary cells down to 1 cell per section

8ece2d7

small edits

167d356

more organizing of cells and creating headings for better navigation

dc485c4

turned zeb projects list to a variable, updated all the proceeding va…

b9bc6c1

…riables, graphs and functions to the zeb projects variable. also more organizing

added bus size chart that excluded the not-specified responses

8b58b8e

final changes before overwriting initial scripts

2ea2948

overwrote fta cleaner script. double checked and ensured script is go…

9ba8be8

…od to go and is exporting to GCS

overwrote TIRCP cleaner script. ran with no errors, files saving to G…

5786212

…CS. GTG

overwrote dgs cleaner script. ran with no errors. wrote to GCS. GTG

d0b4691

added min max summary and outlier flag to utils file. cpb cleaner scr…

b44a240

…ipt ran no errors. files saved to GCS. GTG

started to copy over cells, functions, variables and tables to the fi…

50fa908

…nal NB. still need to get charts moved over

minor bug fixed for markdown to work in final nb

c2cedd7

moved charts over to final NB

2a4120e

moved min max function to NB. reorganized the charts and disabled the…

7be2f2c

… overall charts

csuyat-dot added 6 commits June 26, 2024 19:05

updating Makefile with additional commands, was able to run makefile …

d021015

…to execute the final NB and export a html

full run of Makefile. analysis nb now shows mainly ZEB metrics

13ac4ac

updated output file name for TIRCP cleaner to be consistent with othe…

a433c2c

…r files.

update readme

492cbc1

removed old, initial exploratory notebooks

95787fb

left notes on refacor_notes

8322540

csuyat-dot marked this pull request as ready for review June 28, 2024 21:22

csuyat-dot merged commit 29b1c93 into main Jun 28, 2024
3 checks passed

csuyat-dot deleted the bus_cost_refactor branch June 28, 2024 21:29

tiffanychu90 reviewed Jul 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bus cost refactor #1161

Bus cost refactor #1161

csuyat-dot commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

tiffanychu90 Jul 1, 2024

tiffanychu90 Jul 1, 2024 •

edited

Loading

tiffanychu90 Jul 1, 2024 •

edited

Loading

tiffanychu90 commented Jul 1, 2024

Bus cost refactor #1161

Bus cost refactor #1161

Conversation

csuyat-dot commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

tiffanychu90 Jul 1, 2024

Choose a reason for hiding this comment

tiffanychu90 Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

tiffanychu90 Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

tiffanychu90 commented Jul 1, 2024

tiffanychu90 Jul 1, 2024 •

edited

Loading

tiffanychu90 Jul 1, 2024 •

edited

Loading