This is a sample project for exploring different ways of persisting Geospatial events for long term storage.
This is also an excuse to test various technologies to handle a massive amount of data. This includes different programming languages, as well as cloud services.
- An organization would manage a fleet of vehicles that would report their current GPS position every 5 seconds.
- The vehicles would remain in the area of a specific city and would typically be assigned to a district of the city.
- The data should be archived for 7 years
- The data should be queryable in order to find which cars where moving in a specific period, within a specific area of the city.
Execute the estimate.py
script to compute the estimates in the event-estimator folder.
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| Topic | Consumer Group | Event Type | Produced by | Consumed by | Usage |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| generation | generators | start-generation | viewer | generator | The viewer will initiate a generation |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| generation.broadcast | | stop-generation | viewer | generator | The viewer will cancel any active |
| | | | | | generation before starting a new one. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| generation.agent.* |generation-agents | generate-partition | generator | generator | The generator will partition its work |
| | | | | | with its agents. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| requests.collector |requests-collector| clear-vehicles-data | generator | collector | The generator will ask the collector to |
| | | | | | clear all the files before starting the |
| | | | | | generation. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.generator.* | generators | generate-partition-stats | viewer | generator | The viewer will initiate a generation |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.generator.* | generators | clear-vehicles-data-result | collector | generator | The collector will report on the |
| | | | | | completion of the destruction. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| commands.move.* | | move | generator | collector | The collector will agregate the move |
| | | | | | commands and persist them as a chunk. |
| | | | +-----------------+------------------------------------------+
| | | | | viewer | The viewer will display the move of |
| | | | | | each vehicle. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| commands.flush.* | | flush | generator | collector | At the end of the generation, a flush |
| | | | | | command is sent to force the collectors |
| | | | | | to write the accumulated data. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| events | | aggregate-period-created | collector | viewer | Every time a chunk of data is persisted |
| | | | | | by the collector, some stats on the chunk|
| | | | | | will be sent to the viewer. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| events | | vehicle-generation-started | generator | viewer | Clear the map in the viewer when a new |
| | | | | | generation begins. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| query.vehicles | vehicle-finder | vehicle-query | viewer | finder | A client is querying the persisted data. |
| | | | | | The finder will filter the chunks based |
| | | | | | on the time period and the geohashes of |
| | | | | | the polygon filter. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
|query.vehicles.partitions| vehicle-finder- | vehicle-query-partition | finder | finder | The finder is delegating the file |
| | partitions | | | | processing to the cluster of finders. |
| | | | | | The response will be sent using the |
| | | | | | vehicle-query-partition-result-stats evt |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.viewer.<UID> | | vehicle-query-result | finder | viewer | While parsing the chunks, the finder |
| | | | | | will send all the move commands that |
| | | | | | match the criteria. The viewer will then |
| | | | | | be able to replay them. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.viewer.<UID> | | vehicle-query-result-stats | finder | viewer | Once the query is complete, the finder |
| | | | | | will send some stats to the viewer, to |
| | | | | | measure the performance and processing |
| | | | | | that was required. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.finder.<UID> | | vehicle-query- | finder | finder | This is a partial result sent back to |
| | | partition-result-stats | | | the finder that delegated the |
| | | | | | processing. |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
Note that when a consumer uses a "consumer group" name, it means that the message will be handled only once by a member of the group. This is a regular work queue with competing consumers, which is different from the default pub/sub case.
sequenceDiagram
viewer ->>+ generator: topic:generation/type:start-generation
activate generator-agent
generator ->> generator-agent: topic:generation-agent.*:generate-partition
loop generate move events
activate collector
generator-agent ->> collector: topic:commands.move.*/type:move
loop aggregate events
collector ->> collector: bufferize, then flush
collector -->> viewer: topic:events/type:aggregate-period-created
end
deactivate collector
end
generator-agent ->> generator: topic:inbox.generator.*/type:generate-partition-stats
deactivate generator-agent
note right of generator: force a flush at the end of the generation
generator ->>+ collector: topic:commands.flush.*/type:flush
collector -->>- viewer: topic:events/type:aggregate-period-created
generator ->>- viewer: topic:inbox.viewer.*/type:generation-stats
sequenceDiagram
viewer ->>+ finder: topic:query.vehicles/type:vehicle-query
loop read event files
loop for each matching position
finder ->> viewer: topic:inbox.viewer.*/type:vehicle-query-result
end
end
finder ->>- viewer: topic:inbox.viewer.*/type:vehicle-query-result-stats
sequenceDiagram
viewer ->>+ finder: topic:query.vehicles/type:vehicle-query
par for each matching event file
finder ->>+ finder-agent: inbox.finder.*/type:vehicle-query-partition-result-stats
loop for each matching position
finder-agent ->> viewer: topic:inbox.viewer.*/type:vehicle-query-result
end
finder-agent ->>- finder: topic:inbox.finder.*/type:vehicle-query-partition-result-stats
end
finder ->>- viewer: topic:inbox.viewer.*/type:vehicle-query-result-stats
-
web UI for viewing realtime data or query results
-
distributed services using a message broker
-
durable storage of events
-
distributed event generation with multiple instances for higher throughput
-
multiple data formats (parquet, csv, json, arrow)
-
multiple storage providers (filesystem, S3...)
-
flexible aggregation of events
- max window capacity
- concurrent time windows
- multiple data partitioning strategies (geohash, vehicle id...)
-
flexible search
- serialized or parallelized
- record limit
- timeout
- ttl
- time filters (time range)
- geoloc filters (polygons) using GeoJSON
- data filters (vehicle type)
+------------+
Event Generator-------| Event Hub |------Event Viewer
| (NATS) |
+------------+
| |
+-----+ +-----+
| |
Event collector Event finder
| | |
+----------+ writes reads
| | |
+-------------+ +-----------------------+
| Event Store | | Event Aggregate Store |
| (In memory) | | (File system or S3) |
+-------------+ +-----------------------+
- Event Hub: NATS
- Event Generator: Typescript NodeJS
- Event Viewer: Nuxt server, Typescript, PixiJS
- Event Collector: Typescript NodeJS, https://pola.rs/, Parquet format, S3, geohash
- Event Store: In memory, DuckDB
- Event finder: Typescript NodeJS, geohash, turf
Inspired by https://learn.microsoft.com/en-us/azure/stream-analytics/event-hubs-parquet-capture-tutorial
+-----> Event Viewer
| + NuxtJS
Event Generator ==> Event Hub -------------+ + PowerBI
+ Azure Event Hubs |
+----> Event Collector -------> Event Aggregate Store <------ Event finder
+ Azure Stream Analytics + Azure Blob + Azure Synapse
See https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview and https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-enable-through-portal and https://learn.microsoft.com/en-us/azure/stream-analytics/event-hubs-parquet-capture-tutorial
You should create a Blob storage account with the following attributes:
- Azure Blob Storage or Azure Data Lake Storage Gen 2
- Enable hierarchical namespace = true
Then you can create your new Azure Synapse Analytics resource and use the previously created Storage Account.
See also the tutorial that demonstrates all the steps: https://www.youtube.com/watch?v=WMSF_ScBKDY
- Go to your Azure Synapse portal.
- Select the "Develop" section.
- click the "+" button and select "SQL script"
- name it "configure_db"
- paste the following script and execute it:
CREATE DATABASE vehicles;
GO;
USE vehicles;
-- You should generate a new password
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<REDACTED>';
CREATE DATABASE SCOPED CREDENTIAL VehicleDataCredential
WITH
IDENTITY = 'SHARED ACCESS SIGNATURE',
-- you should copy the SAS token configured for your Storage Account, in the "Shared access signature"
SECRET = 'sv=2022-11-02&ss=b&srt=co&sp=rlf&se=2024-12-31T22:42:44Z&st=2024-12-15T14:42:44Z&spr=https&sig=<REDACTED>';
create external data source VehicleDataEvents with (
-- you should replace morganvehicledata with the name of your Storage Account.
location = 'wasbs://events@morganvehicledata.blob.core.windows.net',
CREDENTIAL = VehicleDataCredential
);
GO;
The strategy is to use the right BULK filter in order to only select the files containing the data, based on the time range and geohash list.
- Go to your Azure Synapse portal.
- Select the "Develop" section.
- click the "+" button and select "SQL script"
- name it "events_per_file"
- paste the following script and execute it:
use vehicles;
SELECT ev.filename() as filename, COUNT(*) as event_count
FROM
OPENROWSET(
BULK '2024-01-01-*-*-*-*.parquet',
DATA_SOURCE = 'VehicleDataEvents',
FORMAT='PARQUET'
) AS ev
WHERE
ev.filepath(1) in ('05', '06', '07')
AND ev.filepath(3) in ('f257v', 'f25k6', 'f25se', 'f25ss')
AND ev.timestamp >= '2024-01-01T05:05:00'
AND ev.timestamp < '2024-01-01T07:05:00'
GROUP BY
ev.filename()
ORDER BY
1
- Go to your Azure Synapse portal.
- Select the "Develop" section.
- click the "+" button and select "SQL script"
- name it "get_events"
- paste the following script and execute it:
use vehicles;
SELECT ev.*
FROM
OPENROWSET(
BULK '2024-01-01-*-*-*-*.parquet',
DATA_SOURCE = 'VehicleDataEvents',
FORMAT='PARQUET'
) AS ev
WHERE
ev.filepath(1) in ('05', '06', '07')
AND ev.filepath(3) in ('f257v', 'f25k6', 'f25se', 'f25ss')
AND ev.timestamp >= '2024-01-01T05:05:00'
AND ev.timestamp < '2024-01-01T07:05:00'
ORDER BY
ev.timestamp
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-parquet-files https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/tutorial-data-analyst
Use a service principal and/or workload identities instead of SAS tokens.
- Event Hub: Azure Event Hubs
- Event Generator: Typescript NodeJS
- Event Viewer: Nuxt server, Typescript, PixiJS
- Event Collector: Azure Stream Analytics
- Event Store: Azure Event Hubs
- Event Finder: Azure Synapse Analytics
+-----> Event Viewer
| + NuxtJS
Event Generator ==> Event Hub -------------+
+ AWS Eventbridge |
+----> Event Collector -------> Event Aggregate Store <------ Event finder
+ Amazon Data Firehose + AWS S3 + Amazon Athena
- Event Hub: Amazon Eventbridge
- Event Generator: Typescript NodeJS
- Event Viewer: Nuxt server, Typescript, PixiJS
- Event Collector: Amazon Data firehose
- Event Store: Amazon Eventbridge
- Event finder: Azure Stream Analytics
- JSON
- CSV
- Arrow
- Parquet
create a script that would do the cost projections for the data that will be aggregated, stored and queried.
- use https://github.com/cloudevents/spec
- support auto reconnect when the server is not available
- Support multiple messaging systems such as RabbitMQ, AWS SQS or Azure Event Hubs
- Support event persistence
- Support retry in message handling
- use long term storage to persist events until they can be aggregated (instead of just using memory)
- implement consistent hashing for paritioning geohashes in the instances
- rewrite in Rust for speedup
- use a mapping widget to display the vehicles?
- try other Web frameworks?
TODO: add a CLI that would listen to stats events and display aggregated stats.
TODO
TODO
Note that you can also use Docker for local development if you prefer. Then jump to the next "Deployment/Docker" section.
Python3 for running some scripts.
python3 -m venv .venv
.venv/bin/pip3 install -r scripts/requirements.txt
NodeJS LTS should be installed.
NATS server should be running. See README for the instructions.
bash scripts/nodejs/cleaup.sh
bash scripts/nodejs/install.sh
You can use the dev mode with hotreload but it is much slower than the production build, when there are a lot of vehicles:
cd ./event/viewer/web/nuxt
open http://localhost:3000/
npm run dev
Or you can build a production release and run it (this is much more performant):
cd ./event/viewer/web/nuxt
npm npm run build
open http://localhost:3000/
npm run preview
.venv/bin/python3 scripts/nodejs/start.py
- docker
- docker-compose
- kubectl
- nodejs
bash scripts/nodejs/docker-build.sh
The following script will start a docker-compose
. Note that we run a single instance of each component in this mode.
bash scripts/nodejs/docker-run.sh
cd deployment/kubernetes
npm i
npm run synth
Select the right Kubernetes cluster (either with the KUBECONFIG env var, or with a kubectl context).
kubectl apply -f dist/*
Wait for everything to be created.
Browse to http://vehicle-fleet-viewer.kube.lab.ile.montreal.qc.ca/
If you want to cleanup the kubernetes resources:
kubectl delete -f dist/0001-vehicles.k8s.yaml
kubectl delete -f dist/0000-vehicles-ns.k8s.yaml
https://medium.com/@igorvgorbenko/geospatial- data-analysis-in-clickhouse-polygons-geohashes-and-h3-indexing-2f55ff100fbe#:~:text=H3%20Indexing,-H3%20indexing&text=Similar%20to%20geohashes%2C%20a%20longer,occupy%20a%20fixed%208%20bytes.
https://medium.com/data-engineering-chariot/aggregating-files-in-your-data-lake-part-1-ed115b95425c https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html https://deepak6446.medium.com/why-did-we-move-from-mongodb-to-athena-with-parquet-297b61ddf299
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-real-time-fraud-detection https://www.red-gate.com/simple-talk/cloud/azure/query-blob-storage-sql-using-azure-synapse/
https://hivekit.io/blog/how-weve-saved-98-percent-in-cloud-costs-by-writing-our-own-database/