notes from meeting w tiffany

amandaha8 · amandaha8 · commit 6a3f8fd83822 · 2024-06-07T18:28:14.000Z
diff --git a/gtfs_digest/17_cardinal_dir_pipeline.ipynb b/gtfs_digest/17_cardinal_dir_pipeline.ipynb
@@ -2197,6 +2197,268 @@
    "source": [
     "test.shape"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "id": "1c491416-5346-403a-a97d-316f13f6085b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "route_cols = [\n",
+    "            \"schedule_gtfs_dataset_key\", \n",
+    "            \"route_id\", \n",
+    "            \"direction_id\"\n",
+    "        ]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "id": "487fee12-e15a-41dd-b040-1d8de4a73057",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.append(\"../gtfs_funnel\")\n",
+    "import schedule_stats_by_route_direction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 63,
+   "id": "b1f0dc58-87bb-4519-b5ef-d3661ff4f370",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "route_dir_metrics = schedule_stats_by_route_direction.schedule_metrics_by_route_direction(\n",
+    "            test, analysis_date, route_cols)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 66,
+   "id": "6d3fa79c-b16e-49a4-b8fa-1c490e5fec14",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ROUTE_TYPOLOGIES = GTFS_DATA_DICT.schedule_tables.route_typologies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 67,
+   "id": "f1672890-a330-45c1-9c2e-c7a35b786aee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "route_typologies = pd.read_parquet(\n",
+    "            f\"{SCHED_GCS}{ROUTE_TYPOLOGIES}_{analysis_date}.parquet\",\n",
+    "            columns = route_cols + [\n",
+    "                \"is_coverage\", \"is_downtown_local\", \n",
+    "                \"is_local\", \"is_rapid\", \"is_express\", \"is_rail\"]\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 68,
+   "id": "8e848b0e-5f1c-485d-9f62-c249e4576454",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "route_dir_metrics2 = pd.merge(\n",
+    "            route_dir_metrics,\n",
+    "            route_typologies,\n",
+    "            on = route_cols,\n",
+    "            how = \"left\"\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 70,
+   "id": "ed49cddf-ee3a-43c3-9cdf-98f852d52f10",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>schedule_gtfs_dataset_key</th>\n",
+       "      <th>route_id</th>\n",
+       "      <th>direction_id</th>\n",
+       "      <th>common_shape_id</th>\n",
+       "      <th>route_name</th>\n",
+       "      <th>avg_scheduled_service_minutes</th>\n",
+       "      <th>avg_stop_miles</th>\n",
+       "      <th>n_trips</th>\n",
+       "      <th>time_period</th>\n",
+       "      <th>frequency</th>\n",
+       "      <th>is_coverage</th>\n",
+       "      <th>is_downtown_local</th>\n",
+       "      <th>is_local</th>\n",
+       "      <th>is_rapid</th>\n",
+       "      <th>is_express</th>\n",
+       "      <th>is_rail</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>36b8fbf12e4adc76b21651462b200860</td>\n",
+       "      <td>569</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>p_859</td>\n",
+       "      <td>Sacramento</td>\n",
+       "      <td>94.00</td>\n",
+       "      <td>2.61</td>\n",
+       "      <td>2</td>\n",
+       "      <td>all_day</td>\n",
+       "      <td>0.08</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>36b8fbf12e4adc76b21651462b200860</td>\n",
+       "      <td>569</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>p_859</td>\n",
+       "      <td>Sacramento</td>\n",
+       "      <td>94.00</td>\n",
+       "      <td>2.61</td>\n",
+       "      <td>1</td>\n",
+       "      <td>offpeak</td>\n",
+       "      <td>0.06</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>36b8fbf12e4adc76b21651462b200860</td>\n",
+       "      <td>569</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>p_859</td>\n",
+       "      <td>Sacramento</td>\n",
+       "      <td>94.00</td>\n",
+       "      <td>2.61</td>\n",
+       "      <td>1</td>\n",
+       "      <td>peak</td>\n",
+       "      <td>0.12</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>36b8fbf12e4adc76b21651462b200860</td>\n",
+       "      <td>569</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>p_867</td>\n",
+       "      <td>Sacramento</td>\n",
+       "      <td>87.50</td>\n",
+       "      <td>3.46</td>\n",
+       "      <td>2</td>\n",
+       "      <td>all_day</td>\n",
+       "      <td>0.08</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>36b8fbf12e4adc76b21651462b200860</td>\n",
+       "      <td>569</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>p_867</td>\n",
+       "      <td>Sacramento</td>\n",
+       "      <td>87.50</td>\n",
+       "      <td>3.46</td>\n",
+       "      <td>2</td>\n",
+       "      <td>peak</td>\n",
+       "      <td>0.25</td>\n",
+       "      <td>1.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "      <td>0.00</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          schedule_gtfs_dataset_key route_id  direction_id common_shape_id  \\\n",
+       "0  36b8fbf12e4adc76b21651462b200860      569          1.00           p_859   \n",
+       "1  36b8fbf12e4adc76b21651462b200860      569          1.00           p_859   \n",
+       "2  36b8fbf12e4adc76b21651462b200860      569          1.00           p_859   \n",
+       "3  36b8fbf12e4adc76b21651462b200860      569          0.00           p_867   \n",
+       "4  36b8fbf12e4adc76b21651462b200860      569          0.00           p_867   \n",
+       "\n",
+       "   route_name  avg_scheduled_service_minutes  avg_stop_miles  n_trips  \\\n",
+       "0  Sacramento                          94.00            2.61        2   \n",
+       "1  Sacramento                          94.00            2.61        1   \n",
+       "2  Sacramento                          94.00            2.61        1   \n",
+       "3  Sacramento                          87.50            3.46        2   \n",
+       "4  Sacramento                          87.50            3.46        2   \n",
+       "\n",
+       "  time_period  frequency  is_coverage  is_downtown_local  is_local  is_rapid  \\\n",
+       "0     all_day       0.08         1.00               0.00      0.00      0.00   \n",
+       "1     offpeak       0.06         1.00               0.00      0.00      0.00   \n",
+       "2        peak       0.12         1.00               0.00      0.00      0.00   \n",
+       "3     all_day       0.08         1.00               0.00      0.00      0.00   \n",
+       "4        peak       0.25         1.00               0.00      0.00      0.00   \n",
+       "\n",
+       "   is_express  is_rail  \n",
+       "0        0.00     0.00  \n",
+       "1        0.00     0.00  \n",
+       "2        0.00     0.00  \n",
+       "3        0.00     0.00  \n",
+       "4        0.00     0.00  "
+      ]
+     },
+     "execution_count": 70,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "route_dir_metrics2.head().drop(columns = ['geometry'\n",
+    "])"
+   ]
   }
  ],
  "metadata": {
diff --git a/gtfs_digest/_cardinal_direction_notes.md b/gtfs_digest/_cardinal_direction_notes.md
@@ -12,8 +12,6 @@ most upstream step (somewhere in gtfs_funnel)
 data processing (situate your steps somewhere, either in gtfs_funnel, but it may be elsewhere, and you'd want to slightly programmatically rename the file, maybe just +_test so you can grab the newly processed data easily through each stage)
   * What does "upstream step" mean?
   * What's the difference between all your work in `data-analyses/gtfs_funnel` and `gtfs_funnel/merge_data.py`. 
-  * What file do you use to run everything to make the `operator_profiles` and `operator_routes_map`?
-  * So I have situated my step, but what am I supposed to programatically rename?
 > every output needs a new suffix while you're testing. and then you point your input to that.
 otherwise, you'd have to run all the dates (with schedule data, that's not a big deal, but 2 min here and there over 30-40 dates does add up).
 stop_times -> aggregate to route_dir -> schedule_data_for_date1 (AH note: this is how the dataset is constructed for one date)
@@ -27,16 +25,16 @@ stop_times -> aggregate to route_dir -> schedule_data_for_date1_test
 time-series = schedule_data_for_date1_test , schedule_data_for_date2_test, schedule_data_for_date3_test
 digest can pull from a test file with just 3 dates to see if it has everything.
 once you're happy with it, you have to run all the scripts that are affected schedule_stats_by_route_direction, anything downstream that uses that intermediate output, all the way through to digest for all the dates in rt_dates.y2023_dates and rt_dates.y2024_dates
-
 * Where do I find `aggregate to route_dir`? 
-* What does `digest` mean? 
 * `Test file`: is this like saving the outputs into GCS like 'may_2024_test.parquet'?
+* How do I know which scripts are affected by `schedule_stats_by_route_direction`?
 
 > yes, but before the meeting, can you go through and identify in a notebook or text file what you think is the "batch" from upstream to downstream?
 script 1, input file is data1, output file is data2
 script 2, input file is data2, output is data3
 and so forth, until you reach the digest, wherever you think the end script is
 * What is the "batch"?
+* What do you mean by "reach the digest?"
 * [Run everything using this Makefile](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/Makefile)
 * [schedule_stats_by_route_direction.py](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/schedule_stats_by_route_direction.py#L15)
     * `assemble_scheduled_trip_metrics` 
@@ -56,4 +54,45 @@ and so forth, until you reach the digest, wherever you think the end script is
     * [Here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/schedule_stats_by_route_direction.py#L118C19-L120) add `test` behind the name and [here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_funnel/schedule_stats_by_route_direction.py#L122) change the `analysis_date_list` to only a few dates.
     * [Here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_digest/merge_data.py#L25) add _test 
     * [Edit here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_digest/merge_data.py#L212) so I only run a few dates.
-    
+* [Only run above here](https://github.com/cal-itp/data-analyses/blob/ah_gtfs_portfolio/gtfs_digest/merge_data.py#L270-L272)
+* Once I'm happy with this, there's a GCS Public Bucket. That's the very, very late step.
+    * After I deploy everything. 
+* 
+#### Terms
+* Batch
+* Upstream: the least processed
+    * Warehouse: could be unzipped files.
+    * For me, it's something downloaded from the mart.
+    * WE process these unprocessed files with scripts. 
+    * Called by helper functions.
+* Downstream: the most final product.
+    * The most downstream is the digest. 
+    * Whatever is done in the GTFS Digest. 
+* Digest
+    * 
+* Funnel 
+    * When Tiffany is downloading stuff.
+    * Funneling the least processed tables through various scripts to trasnform.
+    * There are 5 tables that are transformed repeatedly into different products. 
+    * Everything that is in `gtfs_funnel`.
+    * 3 big workstreams: 
+        * RT Segment Speeds
+        * Schedule
+        * RT vs Schedule. 
+        * Schedule is also processed in GTFS Funnel. 
+    * Schedule data informs a lot of the interpreation of the vehicle positions
+        * We don't know anything about vehicle positions, what route, shape, direction it is. This data is all found in Schedule data that we connect with GTFS Key.
+        * Trips is the main table of the Schedule universe. 
+* Helper functions just pull dataframes that are downloaded.
+* Downloading happens in another step.
+* Tiffany runs the MAKEFILE in gtfs_funnel.
+* Then Tiffany goes to various other folders and runs everything (HQTA, RT Segment Speeds, etc)
+* Everything at the end goes into the GTFS Digest.
+* GTFS Digest is trying to represent everything through the grain of route-direction.
+* Every month Tiffany runs one day.
+* Batch: a combination of various scripts that are run together.
+* Look at the MAKEFILE and compare it to Mermaid. 
+* Digest: simply a compilation by dates, collects the columns TIffany wants
+    * Select the columns they want.
+* Not all the scripts in the Makefile need to be matched.
+*