Interpolated segments for rt_segment_speeds #1084

edasmalchi · 2024-04-25T21:34:35Z

Adds interpolated segments (between stops where stop spacing >1km) to rt_segment_speeds pipeline.

Currently in draft form for review before moving to scripts.

Proposed Upstream Changes:

Should probably happen upstream in cut_stop_segments.py, related scripts...

length: float, geometry.length
next_stop_sequence: lead of stop_sequence, should include final stop seq (final stop seq unavailable here since shifting from existing df...)
- alternatively, rename stop_sequence -> stop_sequence1 and add stop_sequence2 (consistent with existing stop_id1 and stop_id2)

@tiffanychu90 let me know if those look doable and if you'd like to add those or have me give it a try.

This script (to be based on notebook)

Note that only about 6% of statewide segments are long enough to interpolate
4 new functions (split_distance, process_exploded, store_new_geoms, lookup_geom)
column changes:
1. stop_sequence increment proportional to segment distance within arbitrary stop sequence increment
2. segment_id postfix _(int) per segment to maintain uniqueness
"expanding" rows based on their geometry and modifying only some cols in order is a little awkward, using gdf.explode() but post-processing requires that the gdf remains in order and seemed hard to do with Dask
even without parallelization, runs ~12min for entire state which seems reasonable to me

@tiffanychu90 curious to know your general thoughts/feedback. Also happy to find time to pair next week!

If this integrates well with the rest of the pipeline the goal would be to replace the current stop segments product with this version adding interpolated segments, allowing us to retire much of the old speedmaps rt_delay code and fully align speedmaps + open data...

github-actions · 2024-04-25T21:36:14Z

nbviewer URLs for impacted notebooks:

rt_segment_speeds/30_interpolated_segments.ipynb

tiffanychu90

Yeah, I think all those suggestions look good and fit well with column naming in gtfs-segments. Overall, I think there's a good outline of the pieces needed for standardizing the pipeline. Thanks for checking it out!

I like the rough check of just <1,000 m and filter those to go through more postprocessing
When you say: column changes: stop_sequence increment proportional to segment distance within arbitrary stop sequence increment...this would show like stop_sequence=1.75 situated between stop_sequences 1 and 2, but maintain a numeric column?
- follow-up: this is still a TODO, with some notes in 31_interpolated_segments_with_new_stops.ipynb.
  - I think my idea of a numeric stop_sequence1 might be unnecessarily complicated.
  - I'm not sure if just counting the segment 0, 1, 2 will be sufficient downstream if you want to aggregate across trips.
  - Having stop_sequence=2 and next_stop_sequence=2 might not be enough either to distinguish between the segments, and is the same as the above option.
Can I play with how to expand the rows with the gdf.explode(). The functions so far don't look like it could use this create segments and this explode_segments function, but maybe it lends itself to it if I can see the df itself and see the groupings. At any rate, no dask needed, since the segments are cut with pandas + gtfs_segments.
follow-up finding: yes, the 2 functions in geography_utils gets the same results, statewide, under 3 min. It does the gdf.explode as you want
I can clean up a notebook and check it in next week. It's like a toy example of what the dfs should look like at each stage, and I used it for road segments to see if it'll create errors in future steps like nearest_neighbor, stop_arrivals, and speeds.
- follow-up finding: In 32_nearest_neighbor_setup.ipynb, I tested the immediate connection to nearest neighbor. The subsequent steps are less likely to error once the setup is the same. It all works as of now, so fingers crossed for the remaining steps.

edasmalchi · 2024-04-29T22:45:45Z

Thanks for the review and the examples! Busy with SB125 today/tomorrow, but I'll dig into it more later this week.

When you say: column changes: stop_sequence increment proportional to segment distance within arbitrary stop sequence increment...this would show like stop_sequence=1.75 situated between stop_sequences 1 and 2, but maintain a numeric column?

Short answer: yes. Long answer: there's two things going on here: stop_sequence, per the spec, is a non-negative integer that "must increase along the trip but [need not] be consecutive". Also, of course, the distances between stops >1000m apart also vary -- could be 2000m, could be 20km.

So first of all, by going to a float we can continue to make it possible to sort on stop_sequence regardless of whether the original stop_sequence increases from 1 to 2 (an increment of 1), or 24 to 67 (an increment of 43). It also provides a flag of which segments are interpolated (those where either stop_sequence is a non-integer).

By proportional, I thought it would be cool if within that increment, stop_sequence increased by an amount proportional to distance (the number of 1000m interpolated segments). So if stop_sequence originally goes from 1 to 2, and there's a 10km gap between stops (10 interpolated segments), the new stop_sequence would be something like 1, 1.1, 1.2, 1.3, 1.4... , 2 (adding 1/10 each time)

If stop_sequence originally goes from 24 to 67 and there's, say, a 3km gap between stops, the new stop_sequence would be something like 24, 38.33, 52.66, 67. (adding 43/3 each time).

That way you can not only sort in the correct order, but roughly understand how far between the two stops an interpolated segment is, regardless of the actual stop_sequence increment or distance.

Couple good examples in the test data:

tiffanychu90 · 2024-05-01T23:02:35Z

@edasmalchi: Ok! Taking everything into consideration:

I've tried keeping a float stop_sequence1 that will capture the increments proportionally based on distance.
I see what you mean about stop_sequence not being consecutive across trips. Using stop_pair (concatenated version of stop_id1__stop_id2), that would capture which stops flank the segment. I think a stop_pair +stop_sequence + stop_sequence1 would be enough to merge segments with speeds and correctly grab the row.
Right now where I'm at on my branch is: I think stepping it through calculating speeds all is fine, but I would want to double check how to group when averaging, and whether/how to handle the same segment_id
- segment_id is contructed as stop_pair + segment_sequence (the suffix here does increase to track the number of 1km segments there are).

I think we can both merge PRs in, and you can find the files I left for doing a full run for Mar 2024 for all operators.

edasmalchi · 2024-05-01T23:14:03Z

Thanks so much @tiffanychu90!! Merging both, and I'll look at the files from your run later this week 🙂

edasmalchi added 11 commits March 27, 2024 23:38

working concept

892befb

clunky but fast dask preprocess

749dac9

start working on abstraction

7317d95

wip

108f065

working apply accumulate!

2040b28

test apply distinct

d45422e

cleanup, working framework pending post-explode formatting

2ad1e13

fix arrowize

7b3e3e3

tidy nb

62b2133

get down to 1 impure function

9f75d86

store result on gcs

483a41d

edasmalchi requested a review from tiffanychu90 April 25, 2024 21:34

edasmalchi requested a review from KatrinaMKaiser April 25, 2024 21:39

tiffanychu90 approved these changes Apr 26, 2024

View reviewed changes

tiffanychu90 mentioned this pull request Apr 29, 2024

add 2 notebooks testing out interpolated segments with nearest neighbor #1086

Merged

clarify temporary script

866aa6a

edasmalchi marked this pull request as ready for review May 1, 2024 23:11

edasmalchi merged commit 27a7695 into main May 1, 2024
2 checks passed

edasmalchi deleted the interpolated-segments branch May 1, 2024 23:14

tiffanychu90 mentioned this pull request May 10, 2024

Research Request - Speedmap segments for GTFS analytics monthly pipeline #1109

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpolated segments for rt_segment_speeds #1084

Interpolated segments for rt_segment_speeds #1084

edasmalchi commented Apr 25, 2024

github-actions bot commented Apr 25, 2024

tiffanychu90 left a comment •

edited

Loading

edasmalchi commented Apr 29, 2024 •

edited

Loading

tiffanychu90 commented May 1, 2024

edasmalchi commented May 1, 2024

Interpolated segments for rt_segment_speeds #1084

Interpolated segments for rt_segment_speeds #1084

Conversation

edasmalchi commented Apr 25, 2024

Adds interpolated segments (between stops where stop spacing >1km) to rt_segment_speeds pipeline.

Proposed Upstream Changes:

This script (to be based on notebook)

github-actions bot commented Apr 25, 2024

tiffanychu90 left a comment • edited Loading

Choose a reason for hiding this comment

edasmalchi commented Apr 29, 2024 • edited Loading

tiffanychu90 commented May 1, 2024

edasmalchi commented May 1, 2024

tiffanychu90 left a comment •

edited

Loading

edasmalchi commented Apr 29, 2024 •

edited

Loading