Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpolated segments for rt_segment_speeds #1084

Merged
merged 12 commits into from
May 1, 2024
Merged

Conversation

edasmalchi
Copy link
Member

Adds interpolated segments (between stops where stop spacing >1km) to rt_segment_speeds pipeline.

Currently in draft form for review before moving to scripts.

Proposed Upstream Changes:

Should probably happen upstream in cut_stop_segments.py, related scripts...

  • length: float, geometry.length
  • next_stop_sequence: lead of stop_sequence, should include final stop seq (final stop seq unavailable here since shifting from existing df...)
    • alternatively, rename stop_sequence -> stop_sequence1 and add stop_sequence2 (consistent with existing stop_id1 and stop_id2)

@tiffanychu90 let me know if those look doable and if you'd like to add those or have me give it a try.

This script (to be based on notebook)

  • Note that only about 6% of statewide segments are long enough to interpolate
  • 4 new functions (split_distance, process_exploded, store_new_geoms, lookup_geom)
  • column changes:
    1. stop_sequence increment proportional to segment distance within arbitrary stop sequence increment
    2. segment_id postfix _(int) per segment to maintain uniqueness
  • "expanding" rows based on their geometry and modifying only some cols in order is a little awkward, using gdf.explode() but post-processing requires that the gdf remains in order and seemed hard to do with Dask
  • even without parallelization, runs ~12min for entire state which seems reasonable to me

@tiffanychu90 curious to know your general thoughts/feedback. Also happy to find time to pair next week!

If this integrates well with the rest of the pipeline the goal would be to replace the current stop segments product with this version adding interpolated segments, allowing us to retire much of the old speedmaps rt_delay code and fully align speedmaps + open data...

@edasmalchi edasmalchi requested a review from tiffanychu90 April 25, 2024 21:34
Copy link

nbviewer URLs for impacted notebooks:

Copy link
Member

@tiffanychu90 tiffanychu90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think all those suggestions look good and fit well with column naming in gtfs-segments. Overall, I think there's a good outline of the pieces needed for standardizing the pipeline. Thanks for checking it out!

  • I like the rough check of just <1,000 m and filter those to go through more postprocessing
  • When you say: column changes: stop_sequence increment proportional to segment distance within arbitrary stop sequence increment...this would show like stop_sequence=1.75 situated between stop_sequences 1 and 2, but maintain a numeric column?
    • follow-up: this is still a TODO, with some notes in 31_interpolated_segments_with_new_stops.ipynb.
      • I think my idea of a numeric stop_sequence1 might be unnecessarily complicated.
      • I'm not sure if just counting the segment 0, 1, 2 will be sufficient downstream if you want to aggregate across trips.
      • Having stop_sequence=2 and next_stop_sequence=2 might not be enough either to distinguish between the segments, and is the same as the above option.
  • Can I play with how to expand the rows with the gdf.explode(). The functions so far don't look like it could use this create segments and this explode_segments function, but maybe it lends itself to it if I can see the df itself and see the groupings. At any rate, no dask needed, since the segments are cut with pandas + gtfs_segments.
    follow-up finding: yes, the 2 functions in geography_utils gets the same results, statewide, under 3 min. It does the gdf.explode as you want
  • I can clean up a notebook and check it in next week. It's like a toy example of what the dfs should look like at each stage, and I used it for road segments to see if it'll create errors in future steps like nearest_neighbor, stop_arrivals, and speeds.
    • follow-up finding: In 32_nearest_neighbor_setup.ipynb, I tested the immediate connection to nearest neighbor. The subsequent steps are less likely to error once the setup is the same. It all works as of now, so fingers crossed for the remaining steps.

@edasmalchi
Copy link
Member Author

edasmalchi commented Apr 29, 2024

Thanks for the review and the examples! Busy with SB125 today/tomorrow, but I'll dig into it more later this week.

When you say: column changes: stop_sequence increment proportional to segment distance within arbitrary stop sequence increment...this would show like stop_sequence=1.75 situated between stop_sequences 1 and 2, but maintain a numeric column?

Short answer: yes. Long answer: there's two things going on here: stop_sequence, per the spec, is a non-negative integer that "must increase along the trip but [need not] be consecutive". Also, of course, the distances between stops >1000m apart also vary -- could be 2000m, could be 20km.

So first of all, by going to a float we can continue to make it possible to sort on stop_sequence regardless of whether the original stop_sequence increases from 1 to 2 (an increment of 1), or 24 to 67 (an increment of 43). It also provides a flag of which segments are interpolated (those where either stop_sequence is a non-integer).

By proportional, I thought it would be cool if within that increment, stop_sequence increased by an amount proportional to distance (the number of 1000m interpolated segments). So if stop_sequence originally goes from 1 to 2, and there's a 10km gap between stops (10 interpolated segments), the new stop_sequence would be something like 1, 1.1, 1.2, 1.3, 1.4... , 2 (adding 1/10 each time)

If stop_sequence originally goes from 24 to 67 and there's, say, a 3km gap between stops, the new stop_sequence would be something like 24, 38.33, 52.66, 67. (adding 43/3 each time).

That way you can not only sort in the correct order, but roughly understand how far between the two stops an interpolated segment is, regardless of the actual stop_sequence increment or distance.

Couple good examples in the test data:
Screenshot 2024-04-29 155002
Screenshot 2024-04-29 155116

@tiffanychu90
Copy link
Member

@edasmalchi: Ok! Taking everything into consideration:

  • I've tried keeping a float stop_sequence1 that will capture the increments proportionally based on distance.
  • I see what you mean about stop_sequence not being consecutive across trips. Using stop_pair (concatenated version of stop_id1__stop_id2), that would capture which stops flank the segment. I think a stop_pair +stop_sequence + stop_sequence1 would be enough to merge segments with speeds and correctly grab the row.
  • Right now where I'm at on my branch is: I think stepping it through calculating speeds all is fine, but I would want to double check how to group when averaging, and whether/how to handle the same segment_id
    • segment_id is contructed as stop_pair + segment_sequence (the suffix here does increase to track the number of 1km segments there are).

I think we can both merge PRs in, and you can find the files I left for doing a full run for Mar 2024 for all operators.

@edasmalchi edasmalchi marked this pull request as ready for review May 1, 2024 23:11
@edasmalchi
Copy link
Member Author

Thanks so much @tiffanychu90!! Merging both, and I'll look at the files from your run later this week 🙂

@edasmalchi edasmalchi merged commit 27a7695 into main May 1, 2024
2 checks passed
@edasmalchi edasmalchi deleted the interpolated-segments branch May 1, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants