-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bus cost refactor #1161
Bus cost refactor #1161
Conversation
…also started adding updated scripts to draft NB
…al cost and cpb number still match! also had to edit the nb script file to read in old/ folder in gcs
…d outfile file names for scripts. testing calculating cost per bus on merged df to identify/remove outliers first, then trying the new agg function
…ethod of finding outliers first is working. reran all the agg by __ functions and they all work and match
…ng down the summary section
… a column in a df
…riables, graphs and functions to the zeb projects variable. also more organizing
…od to go and is exporting to GCS
…ipt ran no errors. files saved to GCS. GTG
…nal NB. still need to get charts moved over
…to execute the final NB and export a html
@@ -2,7 +2,7 @@ | |||
import pandas as pd | |||
import shared_utils | |||
from calitp_data_analysis.sql import to_snakecase | |||
|
|||
from bus_cost_utils import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change imports to be written either like import bus_cost_utils
and use functions like bus_cost_utils.new_prop_finder()
or only import specific functions (if you really only need a couple things) like from bus_cost_utils import new_prop_finder()
.
Generally, being clear on where a function is imported from is crucial, because as soon as you have multiple utils
functions, you can easily lose track of where something is. In a universe with a lot of functions, writing from pandas import * and from numpy import *
would create clashes because they might even share functions with the same name (sum, mean, etc)!`
If you can change your import statements, that would make functions / where they belong very clear!
# data[data[col1] == data[col1].min()][col_list] | ||
# ) | ||
|
||
def outlier_flag(col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def outlier_flag(df: pd.DataFrame, col: str) -> pd.DataFrame:
# This applies the lambda function for you already, also worked in the absolute value
# There are 2 ways to write this, to do the same thing
return df.apply(lambda x: True if abs(x[col]) > 3
else False, axis=1)
OR something like (double check)
df[col].apply(lambda x: True if abs(x) > 3 else False)
# Also, if you don't like booleans, you can do `.astype(int)` and it'll change True/False or 1/0 (in that order)
In your notebook:
df_agg["new_is_cpb_outlier"] = outlier_flag(df_agg, "new_zscore_cost_per_bus")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .apply
takes a lambda function, which operates on a row. There are 2 ways to write it, and this all depends on what you need to access in the row. If you have just 1 column (z-score), you can write it either way. If you need to access 2 column values, you will have to write it like df.apply(lambda x: some condition, axis=1)
Here, we want to access 2 columns in the lambda condition (state
and temperature
)
Ex: df.apply(lambda x: 1 if ( (x.state == "CA" ) and (x.temperature < 80) ) else 0, axis=1)
The difference in syntax is that you place the .apply
in a different place, and there's also the axis=1 (operate on row)
that's present.
Overall, great refactor! A lot more clear and simple to follow. Everything under the
Strictly speaking, the
The apply is operating on each row. It's looking at the So here, I've kept the function exactly as it was, but simply relabeled the argument and indicated the type is integer.
Where
|
Revisited the Bus procurement cost analysis to refactor/clean up all the scripts and final notebook.
Overall steps taken this round of refactor
bus_cost_utils
module to move all the common functions and variables./old
folder.