- Introduction
- Stored Data
- Preprocessing
- Computing Quickest/Shortest Path
- Answering Queries
- Efficiency and Feasibility
- References
- Conclusion
My version of Google Maps is designed keeping two most basic queries in mind. First, the most important query, which is the shortest path between two places and a major portion of the article is based on how exactly the program will calculate the shortest part for various types of pathfinding queries.
The second important query would to be find something like, "a restaurants" or a "shopping mall" within x kilometers of the current location.
Graphs are undoubtedly the best way to store data of such kind, specially when the most important and the most frequent query is "finding the shortest path between two nodes".
The only problem is, the graph size would be too huge and too complex if we implemented it on the scale of a globe.
To test the feasibility of the map on such a scale, I decided to manually create a graph for a comparitivately smaller section(not too small though), and the results were not surprising.
I took a section of map of middle India, and marked all the "visible" locations, and tried to draw a graph. I was only quarterly done with the process when I realised that it such a structure is not feasible, as it would be much more complex when extended to the scane of globe.
I was only quarterly done with the process when I realised that it such a structure is not feasible, as it would be much more complex when extended to the scale of globe. In the current example, even though a lot of nodes are not visible(because if we use only a single graph, then all small and big places should be counted as a node), still the graph is coming out to be complex.
I had to solve this problem. I knew I had to use the graph, for obvious reasons, but I needed to figure out a way to control the number of nodes, or rather divide the single graph with many nodes, into multiple graphs with lesser number of nodes, to control the size of the network, and prevent it from getting more complex.
When I was sure of dividing the network, the only question left was how?
I needed to maintain the hierarchy of the network as much as I needed the graphs.
So, I came up with an idea of using a hybrid data structure which would be a combination of a Tree and a Graph. The nature of the data structure and it's implementation in the Map Scene will be made clear by the image below.
So let's see what the data structure is, and more about the data attributes it would store.
The given tree would store the data in the hierarchical order of the divisions of the world viz.
- World
- Continents
- Countries
- States
- Cities
- Towns
- Local Places / Businesses
In each level, except the last level, each node would contain a graph containing all the immediate daughter nodes, as nodes of the graph. So the graph will act as one data attribute, and other data attributes associated with the node will include
- Name of the place (String)
- Description (String)
- Central coordinates (List)
- Parents (Dictionary)
- Category/Level (Int)
Let us take the example of "State of Uttar Pradesh" to understand what exactly
Attribute | Value |
---|---|
Name | Uttar Pradesh |
Description | Uttar Pradesh is a state in northern India ... |
Central coordinates | [26.8467 , 80.9462] |
Parents | {Country : India , Continent:Asia} |
Category | State |
This format will be followed for all the places, including continents, countries, states, cities, and towns. Inside towns, there will be a listing of local businesses, which will have a different set of data attributes like "Type of Business" , "Contact Info", "Rating", "Address" etc. Also, those listing will not contain any graph, as they lie last in the tree, and thus have no children.
Now, let us take a look at how these hierarchical graphs would look like.
Here is a small example of the "Asia" continent, and how a graph containing countries in Asia would look like
PS : Yes, some countries have not be included in this graph, this is just to give a basic idea of how the graph would look like.
Anyways, continent level is only added for the sake of uniformity, as a graph between countries makes little to no sense, because almost every country is connect to other via flights, so there is no concept of shortest path, as there is a direct path, and it will be shortest everytime.
Here is a graph indicating different states of India
Here also, some states have not been included and some edges are missing, but you get the idea, right?
Going down another step lower, let's take a look at state level and city level graph. Since we could consider New Delhi as both a state and a city, I am going to display a single graph for both.
This process is also pretty much similar to what we have been doing in the case of country and continents.
PS : Yes, this one looks ugly because of the Metro Network :P
IMPORTANT NOTE: All the maps which we have seen till now, join one place with another directly, without taking care of the roads connecting them.
But that is just for visualization purposes. The distances between two
places(of the same type) will not be direct but will be in accordance to
the road map of India or whichever country we are talking about.
For example, the distance between Uttar Pradesh and
Madhya Pradesh will be the distance of the road NH27 between them and
not the ariel distance. It is shown in the diagram only for demonstration purposes.
To get a more clear idea about getting routes using road maps, let us take a look at the town level graph
This is how the acutal graph creation will take place.
All the road intersections and bifurcations will serve as "Nodes" and their connections will be considered for the graph.
A smart road detection algorithm, along with the usage of the Road map will make it possible. We will be making use of different types of roads to establish routes between different types of places. For example, using national highways to create routes between states, and inter/intra state cities and using state highways for creating routes between cities and towns.
Let us estimate the total size of the database we would be creating. The data we have will be of two types.
Firstly, the visual portion of the map will be a very high resolution image which could to zoomed in (to view even the tiny buildings) and could be zoomed out (to view the entire world at once)
Secondly, the data which we would be storing would be in the structure we just talked about.
- About the image part, here is a calculation made in this article, which approximates the size of the image to be around 1500PB if it is to accomodate for all zoom levels and every other thing.
- About the data structure part, let us calculate:
The calculation will be done in two parts. First, the data stored in the top hierarchy levels, which would contain the graph, as their size would be more or less same. Second, the size of data of all business listings.
For that, we need to approximate the total number of nodes in the tree, then the approx size of one node, and then multiply them. (All the data given below is approximate, and we have tried to take the upper level of approximation, so that we may overestimate, but not underestimate)
1 World (1)
7 Continents (17 = 7)
28 Countries per continent (1728 = 196)
30 states per country (172830 = 5880)
15 cities per state (17283025 = 88200)
50 towns/villages per city (17283015*50 = 4410000)
Total nodes - 4410000 + 88200 + 5880 + 196 + 7 + 1 = 4504284 nodes
Let us assume 100 vertices and 1000 edges in each node(generally it will be much less, only in cases of towns, where we take each road intersection as a vertex), then approximately each node will occupy around 2kb of space, at max.
So total space will be 2*4504284 = 9008568 Kb, which is around 8.59 GB
The number of such nodes will be very high, but each will consume much less data, as they do not have to store the graph, and only details of business.
If we assume around 300 businesses, per town/village, we will have 1323000000 businesses and places all over the world. If each place consumes 0.2Kb of data, then total data would amount to 252 GBs
So total data to be stored in this Data Structure would be around 260GBs at max.
Where will we store them?
Well 260GB is not much data to be stored for an application of the scale of Google Maps, so storing the data would not be a problem.
The creation of the entire data structure will be a preprocessing task. This data will be updated whenever changes are detected, and will be fetched for answering queries.
Once we have the structure preprocessed, we also need to compute the road based distances between two connected nodes in the graph. We also need to calculate the euclidean distance between the two places as we will be using it as the heuristic function for the shortest path calculation.
The second part will be pretty much simple, because we will already be having the road length data from the road map, and for the euclidean distance, we would have the geographical coordinates of both the places, and calculating the distance would be a cakewalk.
So let us talk about the first part. First and foremost, we need to obtain the data, and the organize it according to the hierarchy.
Let's say that we have a combined version of all the images obtained from the satellites. We still need to mark boundries, create divisions and name places.
Since this is a manual task, we would either do it on our own, or outsource the task. For local places and small businesses, Google uses My Business Listing and relies on User's inputs for unexplored places, we would also go down the same line.
In simple words, we would do the initial data creation task manually, and then use crowdsourced data for the expansion.
Alternatively, we could also look for some open-source database which would contain all the data we need, and then use it.
When we think of path finding algorithms, the first thing which comes on our minds(or on the google search results) is Dijkstra's algorithm, and it makes sense, because Dijkstra's is one of the fast and most efficient path finding algorithm we have.
Well, we have already processed most of the possible graphs, and reduced their sizes as much as possible, so now the pathfinding algorithm would not consume much time in the ideal scenarios, even if we skip using any heuristic.
But, we can further improve the computation time, by using a slightly modified version of Dijkstra's which employs the use of a heuristic function along with the cost function.
Yes, I am talking about A* algorithm.
Here is a 2 minute youtube video which helps you visualize how much A* is faster and more efficient than Dijkstras.
So now the question arises, which heuristic function?
Next we look for the heuristic function, or rather an admissible heuristic, so that the results are not only fast, but also optimal.
A heuristic function is considered optimal if it never overestimates the cost of reaching the goal.
Now the most obvious choice infront of us is the Euclidean Distance and we will be using it for two main reasons:
- It is an admissible heuristic, as the euclidean distance between two nodes will also be less than or equal to the exact path between two places.
- It is an easy to precompute heuristic, as we already would have the coordinates of the two nodes, and it would be easy to calculate the euclidean distance.
So while our cost function previously was f(n) = g(n) where g(n) was the actual cost of reaching the destination from a particular node, we change it to
f(n) = g(n) + h(n)
where g(n) remains the same, and h(n) is the estimated cost of reaching the goal, which also happens to be the euclidean distance.
This cost function is actually pretty basic, and would still need a lot more modifications to solve the problem when applied tor real world. For instance, we still need to incorporate a factor t(n) for the traffic, which might change the results. Other factors to include might probably be like, speed limits, road conditions, types of roads etc, which would be experimental, and that data could only be obtained, when actual users travel through some path. So it is a topic for another day.
Well, in the current scenario? NO ! Because so far we only have created a static model, which would just find the best possible route, and show it in the map.
But, we can always add a simple feature, which would detect if the suggested route is being taken. As soon as the route is changed, another function call to reroute would be triggered, and the program would start searching for the new route.
Now, let's talk about the rerouting algorithm. The change of route can only take place at a node(because, there is always a node present when there is a road intersection or bifurcation). So now, the next node in the change path would be considered for the reroute, and another graph containing the nodes till the destination would be created.
Then we would search for the shortest path, in the new graph, if we find one in the same direction the user is going, we show it in the map, otherwise, we ask the user to take a reverse turn, and use the previous path, as it is still the shortest one available.
If the user still continues to drive on the changed path, then the process would continue, till the path user is taking, becomes the one with shortest distance.
SO with this algorithm, YES. Rerouting would not take much time.
We have stored the data, and we have figured out which pathfinding algorithm to use. We also have a heuristic function set up, for fast results. Now we need to analyse how the algorithm works, for different types of queries.
I will divide these types of queries into two parts
- Between two places of same level, with the same parent, - For example, between two cities lying in the same state, or between two towns lying in the same city.
These type of queries would be the easiest to handle, as we would already be having a created graph for them. We just need to specify the source and the destination, and the A* alhorithm would do the job for us. - Between two places of same level, with different parents - For example, between two cities lying in different states, like between Kanpur(in UP) and Jaipur(in Rajasthan).
The time taken for such queries would be a little more as compared to the previous one, because we would not be having an inter state graph. So here is how such queries would be answered. A new graph would be created(taking the data which is already stored in the database) between the two places, considered all the road intersections, of major roads of that level (within a range, shown by the circle in this case).
Then the algorithm would be applied on the new map, to find the shortest path.
Handling such queries would be an easy job. We could easily get the current user's coordinates, which can be used to find all the businesses lying in a particular range, then the business type could be matched with the search parameter, and those matching with the query, would be served.
To add a simple feature, we could also sort the businesses by distance (after fetching them) and then display in order of nearest distance from the user.
Theoretically, it should lay out a decent performance, unless there is some other roadblock, which I am not able to think of at the moment.
Is it feasible? Apparantly yes, but one can not really say, because things at such scale would need a lot more optimizations and improvements.
- Graph data structure and its applications
- Tree data structre
- Dijkstra's vs A* algorithm
- Admissible heuristic functions
- Google Maps, Canva and MS Paint for designing images and graphics
Google Maps is an extremely efficient application, and a lot of planning and brainstorming has been put into it to create what it is today. Replicating it, is not an easy task, and definately not a single person task.
Here is a small effort to understand the process, and trying to do it the way I feel it works.