Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus cost refactor #1161

Merged
merged 36 commits into from
Jun 28, 2024
Merged

Bus cost refactor #1161

merged 36 commits into from
Jun 28, 2024

Conversation

csuyat-dot
Copy link
Contributor

Revisited the Bus procurement cost analysis to refactor/clean up all the scripts and final notebook.

Overall steps taken this round of refactor

  • created bus_cost_utils module to move all the common functions and variables.
  • adjusted cleaner scripts to reference moduel.
  • gave cleaned datasets consistent naming convention (raw, cleaned, bus only) and identical column names to merge on.
  • final, merged dataset contains columnsfor z-score and an outlier flag. 1 set was saved with outliers, another saved with out outliers.
  • used the merged dataset without outliers for the final analysis notebook to create all pivot tables, charts and variables.
  • streamlined final notebook to show mainly ZEB related metrics, used more charts and tables to display information and not be so narrative heavy.
  • deleted old, initial exploratory notebooks.
  • reorganized GCS folder by moving old initial exports to an /old folder.

tiffanychu90 and others added 30 commits June 13, 2024 20:31
…also started adding updated scripts to draft NB
…al cost and cpb number still match! also had to edit the nb script file to read in old/ folder in gcs
…d outfile file names for scripts. testing calculating cost per bus on merged df to identify/remove outliers first, then trying the new agg function
…ethod of finding outliers first is working. reran all the agg by __ functions and they all work and match
…riables, graphs and functions to the zeb projects variable. also more organizing
@csuyat-dot csuyat-dot marked this pull request as ready for review June 28, 2024 21:22
@csuyat-dot csuyat-dot merged commit 29b1c93 into main Jun 28, 2024
3 checks passed
@csuyat-dot csuyat-dot deleted the bus_cost_refactor branch June 28, 2024 21:29
@@ -2,7 +2,7 @@
import pandas as pd
import shared_utils
from calitp_data_analysis.sql import to_snakecase

from bus_cost_utils import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change imports to be written either like import bus_cost_utils and use functions like bus_cost_utils.new_prop_finder() or only import specific functions (if you really only need a couple things) like from bus_cost_utils import new_prop_finder().

Generally, being clear on where a function is imported from is crucial, because as soon as you have multiple utils functions, you can easily lose track of where something is. In a universe with a lot of functions, writing from pandas import * and from numpy import * would create clashes because they might even share functions with the same name (sum, mean, etc)!`

If you can change your import statements, that would make functions / where they belong very clear!

# data[data[col1] == data[col1].min()][col_list]
# )

def outlier_flag(col):
Copy link
Member

@tiffanychu90 tiffanychu90 Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def outlier_flag(df: pd.DataFrame, col: str) -> pd.DataFrame:
   
   # This applies the lambda function for you already, also worked in the absolute value
   # There are 2 ways to write this, to do the same thing
   
   return df.apply(lambda x: True if abs(x[col]) > 3 
   else False, axis=1)
   OR something like (double check)
   df[col].apply(lambda x: True if abs(x) > 3 else False) 
   
   # Also, if you don't like booleans, you can do `.astype(int)` and it'll change True/False or 1/0 (in that order)

In your notebook:

df_agg["new_is_cpb_outlier"] = outlier_flag(df_agg, "new_zscore_cost_per_bus")

Copy link
Member

@tiffanychu90 tiffanychu90 Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .apply takes a lambda function, which operates on a row. There are 2 ways to write it, and this all depends on what you need to access in the row. If you have just 1 column (z-score), you can write it either way. If you need to access 2 column values, you will have to write it like df.apply(lambda x: some condition, axis=1)

Here, we want to access 2 columns in the lambda condition (state and temperature)
Ex: df.apply(lambda x: 1 if ( (x.state == "CA" ) and (x.temperature < 80) ) else 0, axis=1)

The difference in syntax is that you place the .apply in a different place, and there's also the axis=1 (operate on row) that's present.

@tiffanychu90
Copy link
Member

Overall, great refactor! A lot more clear and simple to follow. Everything under the if __name__ == "__main__": to create your datasets makes a lot of sense now, and I can follow which subsets had which transformations.

  • Carry forward: clear imports, now that you have your own functions, import them when you use them. You'll soon have more utility functions than just 1 script, so you'll want to make it easy for you and others to find where those functions are called from within a notebook.
  • This is a clarification note that you can store (it'll take some time for it to sink in) on how to understand df.apply stuff.

Strictly speaking, the df.apply operates on a row. If a df looks like this and we want to flag high temperature rows, we might set up a function called flag_high_temps.

state temperature
CA 90
NV 95
OR 80
TX 90
def flag_high_temps(col):
   return col >= 90

The apply is operating on each row. It's looking at the temperature column, but going row-wise, and checking whether 90 >= 90 (True), 95 >= 90 (True), (80 >= 90 (False)`, etc.

So here, I've kept the function exactly as it was, but simply relabeled the argument and indicated the type is integer.

def flag_high_temps(temperature_value: int):
   return temperature_value >= 90

Where col is typically used is in conjunction with a df. Now, the lambda statement when I use flag_high_temps(df, "temperature") says, look at df and for each row, find the column named temperature, and if that column value is >= 90, return True. If not, return False. Now, col can have the type hint of str, because "temperature" is the name of the column and that name is a string.

def flag_high_temps(df: pd.DataFrame, col: str):
   return df.apply(lambda row: True if row[col] >= 90 else False, axis=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants