Skip to content

Commit 6a3f8fd

Browse files
committed
notes from meeting w tiffany
1 parent 6edadcd commit 6a3f8fd

File tree

2 files changed

+306
-5
lines changed

2 files changed

+306
-5
lines changed

gtfs_digest/17_cardinal_dir_pipeline.ipynb

Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2197,6 +2197,268 @@
21972197
"source": [
21982198
"test.shape"
21992199
]
2200+
},
2201+
{
2202+
"cell_type": "code",
2203+
"execution_count": 61,
2204+
"id": "1c491416-5346-403a-a97d-316f13f6085b",
2205+
"metadata": {},
2206+
"outputs": [],
2207+
"source": [
2208+
"route_cols = [\n",
2209+
" \"schedule_gtfs_dataset_key\", \n",
2210+
" \"route_id\", \n",
2211+
" \"direction_id\"\n",
2212+
" ]"
2213+
]
2214+
},
2215+
{
2216+
"cell_type": "code",
2217+
"execution_count": 62,
2218+
"id": "487fee12-e15a-41dd-b040-1d8de4a73057",
2219+
"metadata": {},
2220+
"outputs": [],
2221+
"source": [
2222+
"import sys\n",
2223+
"sys.path.append(\"../gtfs_funnel\")\n",
2224+
"import schedule_stats_by_route_direction"
2225+
]
2226+
},
2227+
{
2228+
"cell_type": "code",
2229+
"execution_count": 63,
2230+
"id": "b1f0dc58-87bb-4519-b5ef-d3661ff4f370",
2231+
"metadata": {},
2232+
"outputs": [],
2233+
"source": [
2234+
"route_dir_metrics = schedule_stats_by_route_direction.schedule_metrics_by_route_direction(\n",
2235+
" test, analysis_date, route_cols)"
2236+
]
2237+
},
2238+
{
2239+
"cell_type": "code",
2240+
"execution_count": 66,
2241+
"id": "6d3fa79c-b16e-49a4-b8fa-1c490e5fec14",
2242+
"metadata": {},
2243+
"outputs": [],
2244+
"source": [
2245+
"ROUTE_TYPOLOGIES = GTFS_DATA_DICT.schedule_tables.route_typologies"
2246+
]
2247+
},
2248+
{
2249+
"cell_type": "code",
2250+
"execution_count": 67,
2251+
"id": "f1672890-a330-45c1-9c2e-c7a35b786aee",
2252+
"metadata": {},
2253+
"outputs": [],
2254+
"source": [
2255+
"route_typologies = pd.read_parquet(\n",
2256+
" f\"{SCHED_GCS}{ROUTE_TYPOLOGIES}_{analysis_date}.parquet\",\n",
2257+
" columns = route_cols + [\n",
2258+
" \"is_coverage\", \"is_downtown_local\", \n",
2259+
" \"is_local\", \"is_rapid\", \"is_express\", \"is_rail\"]\n",
2260+
" )"
2261+
]
2262+
},
2263+
{
2264+
"cell_type": "code",
2265+
"execution_count": 68,
2266+
"id": "8e848b0e-5f1c-485d-9f62-c249e4576454",
2267+
"metadata": {},
2268+
"outputs": [],
2269+
"source": [
2270+
"route_dir_metrics2 = pd.merge(\n",
2271+
" route_dir_metrics,\n",
2272+
" route_typologies,\n",
2273+
" on = route_cols,\n",
2274+
" how = \"left\"\n",
2275+
" )"
2276+
]
2277+
},
2278+
{
2279+
"cell_type": "code",
2280+
"execution_count": 70,
2281+
"id": "ed49cddf-ee3a-43c3-9cdf-98f852d52f10",
2282+
"metadata": {},
2283+
"outputs": [
2284+
{
2285+
"data": {
2286+
"text/html": [
2287+
"<div>\n",
2288+
"<style scoped>\n",
2289+
" .dataframe tbody tr th:only-of-type {\n",
2290+
" vertical-align: middle;\n",
2291+
" }\n",
2292+
"\n",
2293+
" .dataframe tbody tr th {\n",
2294+
" vertical-align: top;\n",
2295+
" }\n",
2296+
"\n",
2297+
" .dataframe thead th {\n",
2298+
" text-align: right;\n",
2299+
" }\n",
2300+
"</style>\n",
2301+
"<table border=\"1\" class=\"dataframe\">\n",
2302+
" <thead>\n",
2303+
" <tr style=\"text-align: right;\">\n",
2304+
" <th></th>\n",
2305+
" <th>schedule_gtfs_dataset_key</th>\n",
2306+
" <th>route_id</th>\n",
2307+
" <th>direction_id</th>\n",
2308+
" <th>common_shape_id</th>\n",
2309+
" <th>route_name</th>\n",
2310+
" <th>avg_scheduled_service_minutes</th>\n",
2311+
" <th>avg_stop_miles</th>\n",
2312+
" <th>n_trips</th>\n",
2313+
" <th>time_period</th>\n",
2314+
" <th>frequency</th>\n",
2315+
" <th>is_coverage</th>\n",
2316+
" <th>is_downtown_local</th>\n",
2317+
" <th>is_local</th>\n",
2318+
" <th>is_rapid</th>\n",
2319+
" <th>is_express</th>\n",
2320+
" <th>is_rail</th>\n",
2321+
" </tr>\n",
2322+
" </thead>\n",
2323+
" <tbody>\n",
2324+
" <tr>\n",
2325+
" <th>0</th>\n",
2326+
" <td>36b8fbf12e4adc76b21651462b200860</td>\n",
2327+
" <td>569</td>\n",
2328+
" <td>1.00</td>\n",
2329+
" <td>p_859</td>\n",
2330+
" <td>Sacramento</td>\n",
2331+
" <td>94.00</td>\n",
2332+
" <td>2.61</td>\n",
2333+
" <td>2</td>\n",
2334+
" <td>all_day</td>\n",
2335+
" <td>0.08</td>\n",
2336+
" <td>1.00</td>\n",
2337+
" <td>0.00</td>\n",
2338+
" <td>0.00</td>\n",
2339+
" <td>0.00</td>\n",
2340+
" <td>0.00</td>\n",
2341+
" <td>0.00</td>\n",
2342+
" </tr>\n",
2343+
" <tr>\n",
2344+
" <th>1</th>\n",
2345+
" <td>36b8fbf12e4adc76b21651462b200860</td>\n",
2346+
" <td>569</td>\n",
2347+
" <td>1.00</td>\n",
2348+
" <td>p_859</td>\n",
2349+
" <td>Sacramento</td>\n",
2350+
" <td>94.00</td>\n",
2351+
" <td>2.61</td>\n",
2352+
" <td>1</td>\n",
2353+
" <td>offpeak</td>\n",
2354+
" <td>0.06</td>\n",
2355+
" <td>1.00</td>\n",
2356+
" <td>0.00</td>\n",
2357+
" <td>0.00</td>\n",
2358+
" <td>0.00</td>\n",
2359+
" <td>0.00</td>\n",
2360+
" <td>0.00</td>\n",
2361+
" </tr>\n",
2362+
" <tr>\n",
2363+
" <th>2</th>\n",
2364+
" <td>36b8fbf12e4adc76b21651462b200860</td>\n",
2365+
" <td>569</td>\n",
2366+
" <td>1.00</td>\n",
2367+
" <td>p_859</td>\n",
2368+
" <td>Sacramento</td>\n",
2369+
" <td>94.00</td>\n",
2370+
" <td>2.61</td>\n",
2371+
" <td>1</td>\n",
2372+
" <td>peak</td>\n",
2373+
" <td>0.12</td>\n",
2374+
" <td>1.00</td>\n",
2375+
" <td>0.00</td>\n",
2376+
" <td>0.00</td>\n",
2377+
" <td>0.00</td>\n",
2378+
" <td>0.00</td>\n",
2379+
" <td>0.00</td>\n",
2380+
" </tr>\n",
2381+
" <tr>\n",
2382+
" <th>3</th>\n",
2383+
" <td>36b8fbf12e4adc76b21651462b200860</td>\n",
2384+
" <td>569</td>\n",
2385+
" <td>0.00</td>\n",
2386+
" <td>p_867</td>\n",
2387+
" <td>Sacramento</td>\n",
2388+
" <td>87.50</td>\n",
2389+
" <td>3.46</td>\n",
2390+
" <td>2</td>\n",
2391+
" <td>all_day</td>\n",
2392+
" <td>0.08</td>\n",
2393+
" <td>1.00</td>\n",
2394+
" <td>0.00</td>\n",
2395+
" <td>0.00</td>\n",
2396+
" <td>0.00</td>\n",
2397+
" <td>0.00</td>\n",
2398+
" <td>0.00</td>\n",
2399+
" </tr>\n",
2400+
" <tr>\n",
2401+
" <th>4</th>\n",
2402+
" <td>36b8fbf12e4adc76b21651462b200860</td>\n",
2403+
" <td>569</td>\n",
2404+
" <td>0.00</td>\n",
2405+
" <td>p_867</td>\n",
2406+
" <td>Sacramento</td>\n",
2407+
" <td>87.50</td>\n",
2408+
" <td>3.46</td>\n",
2409+
" <td>2</td>\n",
2410+
" <td>peak</td>\n",
2411+
" <td>0.25</td>\n",
2412+
" <td>1.00</td>\n",
2413+
" <td>0.00</td>\n",
2414+
" <td>0.00</td>\n",
2415+
" <td>0.00</td>\n",
2416+
" <td>0.00</td>\n",
2417+
" <td>0.00</td>\n",
2418+
" </tr>\n",
2419+
" </tbody>\n",
2420+
"</table>\n",
2421+
"</div>"
2422+
],
2423+
"text/plain": [
2424+
" schedule_gtfs_dataset_key route_id direction_id common_shape_id \\\n",
2425+
"0 36b8fbf12e4adc76b21651462b200860 569 1.00 p_859 \n",
2426+
"1 36b8fbf12e4adc76b21651462b200860 569 1.00 p_859 \n",
2427+
"2 36b8fbf12e4adc76b21651462b200860 569 1.00 p_859 \n",
2428+
"3 36b8fbf12e4adc76b21651462b200860 569 0.00 p_867 \n",
2429+
"4 36b8fbf12e4adc76b21651462b200860 569 0.00 p_867 \n",
2430+
"\n",
2431+
" route_name avg_scheduled_service_minutes avg_stop_miles n_trips \\\n",
2432+
"0 Sacramento 94.00 2.61 2 \n",
2433+
"1 Sacramento 94.00 2.61 1 \n",
2434+
"2 Sacramento 94.00 2.61 1 \n",
2435+
"3 Sacramento 87.50 3.46 2 \n",
2436+
"4 Sacramento 87.50 3.46 2 \n",
2437+
"\n",
2438+
" time_period frequency is_coverage is_downtown_local is_local is_rapid \\\n",
2439+
"0 all_day 0.08 1.00 0.00 0.00 0.00 \n",
2440+
"1 offpeak 0.06 1.00 0.00 0.00 0.00 \n",
2441+
"2 peak 0.12 1.00 0.00 0.00 0.00 \n",
2442+
"3 all_day 0.08 1.00 0.00 0.00 0.00 \n",
2443+
"4 peak 0.25 1.00 0.00 0.00 0.00 \n",
2444+
"\n",
2445+
" is_express is_rail \n",
2446+
"0 0.00 0.00 \n",
2447+
"1 0.00 0.00 \n",
2448+
"2 0.00 0.00 \n",
2449+
"3 0.00 0.00 \n",
2450+
"4 0.00 0.00 "
2451+
]
2452+
},
2453+
"execution_count": 70,
2454+
"metadata": {},
2455+
"output_type": "execute_result"
2456+
}
2457+
],
2458+
"source": [
2459+
"route_dir_metrics2.head().drop(columns = ['geometry'\n",
2460+
"])"
2461+
]
22002462
}
22012463
],
22022464
"metadata": {

gtfs_digest/_cardinal_direction_notes.md

Lines changed: 44 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,6 @@ most upstream step (somewhere in gtfs_funnel)
1212
data processing (situate your steps somewhere, either in gtfs_funnel, but it may be elsewhere, and you'd want to slightly programmatically rename the file, maybe just +_test so you can grab the newly processed data easily through each stage)
1313
* What does "upstream step" mean?
1414
* What's the difference between all your work in `data-analyses/gtfs_funnel` and `gtfs_funnel/merge_data.py`.
15-
* What file do you use to run everything to make the `operator_profiles` and `operator_routes_map`?
16-
* So I have situated my step, but what am I supposed to programatically rename?
1715
> every output needs a new suffix while you're testing. and then you point your input to that.
1816
otherwise, you'd have to run all the dates (with schedule data, that's not a big deal, but 2 min here and there over 30-40 dates does add up).
1917
stop_times -> aggregate to route_dir -> schedule_data_for_date1 (AH note: this is how the dataset is constructed for one date)
@@ -27,16 +25,16 @@ stop_times -> aggregate to route_dir -> schedule_data_for_date1_test
2725
time-series = schedule_data_for_date1_test , schedule_data_for_date2_test, schedule_data_for_date3_test
2826
digest can pull from a test file with just 3 dates to see if it has everything.
2927
once you're happy with it, you have to run all the scripts that are affected schedule_stats_by_route_direction, anything downstream that uses that intermediate output, all the way through to digest for all the dates in rt_dates.y2023_dates and rt_dates.y2024_dates
30-
3128
* Where do I find `aggregate to route_dir`?
32-
* What does `digest` mean?
3329
* `Test file`: is this like saving the outputs into GCS like 'may_2024_test.parquet'?
30+
* How do I know which scripts are affected by `schedule_stats_by_route_direction`?
3431

3532
> yes, but before the meeting, can you go through and identify in a notebook or text file what you think is the "batch" from upstream to downstream?
3633
script 1, input file is data1, output file is data2
3734
script 2, input file is data2, output is data3
3835
and so forth, until you reach the digest, wherever you think the end script is
3936
* What is the "batch"?
37+
* What do you mean by "reach the digest?"
4038
* [Run everything using this Makefile](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/Makefile)
4139
* [schedule_stats_by_route_direction.py](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/schedule_stats_by_route_direction.py#L15)
4240
* `assemble_scheduled_trip_metrics`
@@ -56,4 +54,45 @@ and so forth, until you reach the digest, wherever you think the end script is
5654
* [Here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/schedule_stats_by_route_direction.py#L118C19-L120) add `test` behind the name and [here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/schedule_stats_by_route_direction.py#L122) change the `analysis_date_list` to only a few dates.
5755
* [Here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_digest/merge_data.py#L25) add _test
5856
* [Edit here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_digest/merge_data.py#L212) so I only run a few dates.
59-
57+
* [Only run above here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_digest/merge_data.py#L270-L272)
58+
* Once I'm happy with this, there's a GCS Public Bucket. That's the very, very late step.
59+
* After I deploy everything.
60+
*
61+
#### Terms
62+
* Batch
63+
* Upstream: the least processed
64+
* Warehouse: could be unzipped files.
65+
* For me, it's something downloaded from the mart.
66+
* WE process these unprocessed files with scripts.
67+
* Called by helper functions.
68+
* Downstream: the most final product.
69+
* The most downstream is the digest.
70+
* Whatever is done in the GTFS Digest.
71+
* Digest
72+
*
73+
* Funnel
74+
* When Tiffany is downloading stuff.
75+
* Funneling the least processed tables through various scripts to trasnform.
76+
* There are 5 tables that are transformed repeatedly into different products.
77+
* Everything that is in `gtfs_funnel`.
78+
* 3 big workstreams:
79+
* RT Segment Speeds
80+
* Schedule
81+
* RT vs Schedule.
82+
* Schedule is also processed in GTFS Funnel.
83+
* Schedule data informs a lot of the interpreation of the vehicle positions
84+
* We don't know anything about vehicle positions, what route, shape, direction it is. This data is all found in Schedule data that we connect with GTFS Key.
85+
* Trips is the main table of the Schedule universe.
86+
* Helper functions just pull dataframes that are downloaded.
87+
* Downloading happens in another step.
88+
* Tiffany runs the MAKEFILE in gtfs_funnel.
89+
* Then Tiffany goes to various other folders and runs everything (HQTA, RT Segment Speeds, etc)
90+
* Everything at the end goes into the GTFS Digest.
91+
* GTFS Digest is trying to represent everything through the grain of route-direction.
92+
* Every month Tiffany runs one day.
93+
* Batch: a combination of various scripts that are run together.
94+
* Look at the MAKEFILE and compare it to Mermaid.
95+
* Digest: simply a compilation by dates, collects the columns TIffany wants
96+
* Select the columns they want.
97+
* Not all the scripts in the Makefile need to be matched.
98+
*

0 commit comments

Comments
 (0)