A program that reduces the number of Manhattan roads in New Jork City while optimizing various characteristics, such as the total length of routes. It uses heuristic algorythm, so the solutions are not necessarily globally optimal.
There are two ways to run the program. The simpler way (in terms of the number
of commands that need to be entered) is to execute the make
command in the
project's root directory:
$ make [mode]
The mode
can be either cli
or gui
. In case no mode
is given, the
program will run in cli mode. This command (when run the first time) downloads
the datasets, preprocess them, and runs the program, including building the
cache. It may take significant amount of time, but the subsequent runs will be
much faster. The downside of this way of running the program is that there is
no way of specifying the algorithm's parameters, such as iteration count,
weights, etc.
Another way is to execute the following commands:
$ ./getdata.sh
$ python -m venv .venv
$ . .venv/bin/activate # platform dependent
$ pip install -r requirements.txt
$ python -m manhattan [options]
The third command is dependent on the operating system and the system shell.
For more information, visit
https://docs.python.org/3/library/venv.html#how-venvs-work.
A list of available options with their descriptions is available by executing
the python -m manhattan --help
command.
The following datasets are used by the program:
- CSCL: A dataset containing information about New York City's road infrastrukture, such as location and shape of each street, traffic directions, and road types. Although this dataset covers the road infrastrukture of all of New York City, only roads located in the Manhattan district are considered (the rest is filtered out in the preprocessing step).
- YTTD: A dataset generated from taxi trips, including information on e.g. the pick-up and drop-off locations and the length of each route. The pick-up and drop-off locations are given as taxi zone identifiers. The records do not contain information about the exact routes of the trips, i.e. what roads were traversed during a given trip.
- TZ and TZL: Datasets containing information on taxi zones. These zones are disjoint and allow to unambiguously determine the belonging of any geographic point to each of them. Not every geographic point lying within New York City boundries belongs to any of the zones, e.g. bridges connecting Manhattan with other districts of New York are not covered by them.
The data extraction process involves extracting from the datasets only the information that is needed to generate solutions. All the resulting data is saved in the CSV format.
In case of the CSCL dataset, data on the shape of each road, its width, and the traffic direction are extracted. In addition, records for roads that are not within Manhattan, have not yet been build, are impassable, or are of a type other than street, highway, bridge, tunnel, driveway, or U-turn are removed.
From the YTTD dataset, routes whose start or end location is outside the Manhattan borough are filtered out, as well as those where the two locations (specified by taxi zone IDs) are the same. Data such as tip amount or payment type are also filtered out.
Taxi zone identifiers, names, and geometries are extracted from the TZ and TZL datasets. Data on zones outside the Manhattan borough are ignored.
Based on the CSCL, TZ, and TZL datasets, a directed graph is generated, representing the road infrastructure of Manhattan. The vertices in this graph represent points of the roads and contain information about the road identifier, geographic coordinates, and the taxi zone to which the point belongs (if any). When two or more points from the CSCL dataset have the same coordinates, they are mapped to the same vertex, in effect creating a representation of an intersection. Such vertices are assigned to multiple roads simultaneously. Edges represent road segments. An edge belongs to the road to which both of its vertices belong.
![]() |
---|
Visualization of the graph that is a model of the Manhattan road infrastructure. |
The graph generated in the previous step contains a redundant number of vertices, which is due, among other things, to the fact that the CSCL dataset maps the shape of the roads and not just the connections between them, which is unnecessary for this program. In order to reduce the time and memory complexity without significantly affecting the results, the number of vertices is reduced in a way that meets the following requirements:
- The length of any length-optimal taxi route does not change by more than half the sum of the length of the start road (i.e., the road containing the edge connecting the start vertex to the next vertex of the route) and the end road (i.e., the road containing the edge connecting the second-to-last route's vertex to the end vertex).
- The sequence of taxi zones, which include consecutive segments (i.e., sequences of consecutive vertices from a specific route) of any length-optimal route, does not change. This means that a taxi following the length-optimal route between vertices A and B, after the reduction process (assuming that the vertices were not removed during the process), passes through the same zones in the same order as if it had followed the length-optimal route between the same vertices before the reduction process. If either the starting or ending vertex has been removed, the new starting or ending vertex becomes the vertex closest to the removed vertex among those belonging to the same zone.
- The sequence of roads containing successive segments of any of the length-optimal routes does not change. This means that a taxi following the optimal route between vertices A and B after the reduction process passes through the same roads in the same order as if it had followed the optimal route between the same vertices before the reduction process.
- Removed vertices and their nearest neighbours always belong to the same road (and only to it) and the same zone, which guarantees that if the start or end vertex of a certain route has been removed, the updated route still starts and ends on the same roads. This prevents the removal of all vertices of a certain road and the removal of vertices that constitute intersections. This applies only to those vertices that can be part of the optimal route.
As a result of this process, the number of vertices decreases from 20617 to 10516 (a decrease of 99.05%). In the above description of the requirements, by length-optimal route is meant all optimal routes that can exist, not just the routes from the data set used.
![]() |
---|
Visualization of a fragment of the infrastructure model graph showing roads having a shape that is not a straight line. |
The collapsing process involves replacing groups of vertices with single vertices and updating edges accordingly, without violating complexity reduction requirements. As a result of this process, the total length of all edges does not change (not including inaccuracies due to the limited precision of floating point numbers). The vertex groups that can be collapsed are called collapsable groups. For a vertex group to be collapsable, the following criteria must be met:
- There must exist a valid route that contains all the vertices belonging to a given group, while not additionally containing any vertex that does not belong to that group. This ensures that, logically, a given group represents a continuous road section.
- All vertices in the group must have exactly two neighbours. This condition guarantees that none of the vertices in the group is an intersection representation.
- All vertices in the group must belong to the same road and taxi zone.
- All vertices in the group must have the same number of edges directed to neighbouring vertices (which do not have to belong to a given group), and this number can be either one or two.
- If each vertex in the group has one edge directed to any of its neighbouring vertices, then the neighbouring vertex cannot have an edge directed to that vertex. If condition 1 is additionally satisfied, this ensures that the group, along with its outer neighbours, represents a one-way road segment.
- If each vertex in the group has two edges facing neighbouring vertices, then each of the neighbours of any of these vertices must be connected by exactly one edge facing that vertex. If condition 1 is additionally satisfied, this ensures that the group, along with its outer neighbours, represents a two-way road segment.
Criteria 1, 2, and 3 ensure that vertice collapsion does not violate complexity reduction conditions, while criteria 4, 5, and 6 ensure that the collapsed vertices and their outer neighbours form a valid road. A valid road is a sequence of vertices in which the count and the directionality of the edges between each pair of directly consecutive vertices is the same, which means that a road can be either bidirectional or unidirectional, with the direction remaining the same throughout the whole segment in the case of a unidirectional road.
The group is collapsed in such a way that all vertices in the group along with the edges to which they are connected (regardless of the directionality of these edges) are removed, after which a new vertex and edges of the same total length are inserted, connected to the new vertex and those vertices outside the group that were adjacent to any of the removed vertices, without changing the number or directionality of the edges of the vertices outside the group.
![]() |
---|
Visualization of the effect of the vertex collapsion process on a fragment of the infrastructure model. |
In this step, isolated groups of vertices are removed. Isolated group is a group of vertices in which no vertex is connected to any vertex outside the group and the cardinality of the group does not exceed a certain fixed value. In the program, 128 was taken as the maximum count of an isolated group. Edges whose vertices belong to a given group are also removed in this step.
Redundancy in the number of vertices is also caused by dendrites, i.e., unisolated groups of connected vertices in which no vertex can be part of any length-optimal taxi route.
A group cannot be part of an optimal route if none of the vertices belong to any taxi zone (which ensures that none of the vertices in the group is a start or end location) and one of the following criteria is met:
- A single inner vertex (i.e., a vertex belonging to a given group) is connected to exactly one outer vertex (i.e., a vertex not belonging to a given group). All other inner vertices are connected only to other inner vertices. The outer vertex does not have to belong to the taxi zone. This criterion guarantees that if there is a certain taxi route to which a certain subset of vertices from a given group belong, the route has a start or end location that is one of the vertices of the group, or the route is not length-optimal (because the outer vertex would occur twice in the route). A group that meets the given criterion is an alpha group. The figure below shows an alpha dendrite. The inner vertices and edges are shown in black, while the rest are shown in grey.
- Many inner vertices are connected to outer vertices, with all edges being incoming edges, i.e. edges directed to the inner vertex. This guarantees that if there is a certain taxi route that includes a certain subset of vertices from a given group, the end location of the route must be one of the vertices from the group. The group that meets the criterion is the beta group. The figure below shows a beta dendrite. The inner vertices and edges are shown in black, while the rest are shown in grey.
- Many inner vertices are connected to outer vertices, with all edges being outgoing edges, i.e. edges directed to the outer vertex. This guarantees that if there is a certain taxi route to which a subset of the vertices in a given group belong, the starting location of that route must be one of the vertices in the group. The group that satisfies the given criterion is the gamma group. The figure below shows a gamma dendrite. The inner vertices and edges are shown in black, while the rest are shown in grey.
Dendrites are identified and removed in a greedy manner. Adding any external vertex to a group classified as a dendrite would make that group no longer a dendrite.
Based on the created infrastructure model and data from the YTTD dataset, interpolation of taxi routes is carried out, which involves determining a specific route. The starting and ending vertices are selected so that they are in the starting and ending taxi zones, respectively, and so that the length of the shortest route between them is as close as possible to that in the dataset. The result of the interpolation is the shortest route between the start and end vertices determined by the Dijkstra algorithm.
The quality of a particular solution is the degree to which its cost is close to zero. The cost of a solution is calculated as follows:
where
-
$L$ - total road length in miles; -
$A$ - total road area in square miles; -
$D$ - average route length in miles; -
$R$ - average road load; -
$L_N$ ,$A_N$ ,$D_N$ ,$R_N$ - normalized values of the parameters$L$ ,$A$ ,$D$ , and$R$ respectively; -
$w_{L}$ ,$w_{A}$ ,$w_{D}$ ,$w_{R}$ - parameter weights, respectively:$L_N$ ,$A_N$ ,$D_N$ and$R_N$ .
Road load is the ratio of the number of routes containing any vertex of a given road to the number of all roads.
The values of the weights are user-defined, and their sum must be
The described normalization in mathematical notation has the following form:
where:
-
$l_i$ - length of the $i$th road; -
$a_i$ - area of the $i$th road; -
$d_i$ - length of the $i$th route; -
$r_i$ - load of the $i$th road; -
$L_0$ ,$A_0$ ,$D_0$ ,$R_0$ - initial values of the parameters$L$ ,$A$ ,$D$ , and$R$ respectively; and
Road network optimization involves iteratively reducing the number of roads at the highest possible drop in cost. In addition, the optimization proceeds in a way that reduces the likelihood of eliminating all accesses to any of the buildings. This is accomplished by not including in the selection of roads for removal those roads that contained a vertex in common with any of the roads removed in previous iterations.
Each iteration of the algorithm consists of the following steps:
- A random selection of a certain user-determined number of roads. The selection is carried out in a way that excludes the appearance of duplicates. Roads containing a vertex that is part of any of the roads removed in previous iterations are not taken into account.
- Cost calculation of removing each of the selected roads separately. This requires re-interpolating taxi routes that contain a section of the road which removal cost is being calculated. In the case that either the starting or ending vertex is part of that road, a new starting or ending vertex is determined, respectively, that belongs to the same taxi zone and is closest to the vertex being replaced. Once the end vertices are determined, the shortest (in terms of length) route between them is determined, replacing the previous route. If it becomes impossible to update any of the routes after removing a particular road, the calculation of the cost of its removal is skipped and the road is removed from the pool of considered roads (it will not be considered in subsequent iterations). The operations performed in this step are temporary, i.e. they do not modify the original structures such as the graph or taxi trips.
- Removal of the road that gives the least-cost solution. All vertices that belong to a given road and only that road are removed. The vertices that represent the intersection of a given road with another are not removed. In addition, those edges whose at least one vertex is removed are eliminated.
- Taxi trips update that replaces the original routes with those determined in step 2 for the road removed in the previous step.
- Removal of roads containing a vertex that belonged to the removed road from the pool of considered roads.