Purpose

This is a sample project for exploring different ways of persisting Geospatial events for long term storage.

This is also an excuse to test various technologies to handle a massive amount of data. This includes different programming languages, as well as cloud services.

Functional requirements

An organization would manage a fleet of vehicles that would report their current GPS position every 5 seconds.
The vehicles would remain in the area of a specific city and would typically be assigned to a district of the city.
The data should be archived for 7 years
The data should be queryable in order to find which cars where moving in a specific period, within a specific area of the city.

Estimates

Execute the estimate.py script to compute the estimates in the event-estimator folder.

Architecture

Messages

+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| Topic                   | Consumer Group   | Event Type                 | Produced by    | Consumed by     | Usage                                    |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| generation              | generators       | start-generation           | viewer         | generator       | The viewer will initiate a generation    |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| generation.broadcast    |                  | stop-generation            | viewer         | generator       | The viewer will cancel any active        |
|                         |                  |                            |                |                 | generation before starting a new one.    |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| generation.agent.*      |generation-agents | generate-partition         | generator      | generator       | The generator will partition its work    |
|                         |                  |                            |                |                 | with its agents.                         |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| requests.collector      |requests-collector| clear-vehicles-data        | generator      | collector       | The generator will ask the collector to  |
|                         |                  |                            |                |                 | clear all the files before starting the  |
|                         |                  |                            |                |                 | generation.                              |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.generator.*       | generators       | generate-partition-stats   | viewer         | generator       | The viewer will initiate a generation    |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.generator.*       | generators       | clear-vehicles-data-result | collector      | generator       | The collector will report on the         |
|                         |                  |                            |                |                 | completion of the destruction.           |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| commands.move.*         |                  | move                       | generator      | collector       | The collector will agregate the move     |
|                         |                  |                            |                |                 | commands and persist them as a chunk.    |
|                         |                  |                            |                +-----------------+------------------------------------------+
|                         |                  |                            |                | viewer          | The viewer will display the move of      |
|                         |                  |                            |                |                 | each vehicle.                            |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| commands.flush.*        |                  | flush                      | generator      | collector       | At the end of the generation, a flush    |
|                         |                  |                            |                |                 | command is sent to force the collectors  |
|                         |                  |                            |                |                 | to write the accumulated data.           |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| events                  |                  | aggregate-period-created   | collector      | viewer          | Every time a chunk of data is persisted  |
|                         |                  |                            |                |                 | by the collector, some stats on the chunk|
|                         |                  |                            |                |                 | will be sent to the viewer.              |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| events                  |                  | vehicle-generation-started | generator      | viewer          | Clear the map in the viewer when a new   |
|                         |                  |                            |                |                 | generation begins.                       |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| query.vehicles          | vehicle-finder   | vehicle-query              | viewer         | finder          | A client is querying the persisted data. |
|                         |                  |                            |                |                 | The finder will filter the chunks based  |
|                         |                  |                            |                |                 | on the time period and the geohashes of  |
|                         |                  |                            |                |                 | the polygon filter.                      |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
|query.vehicles.partitions| vehicle-finder-  | vehicle-query-partition    | finder         | finder          | The finder is delegating the file        |
|                         | partitions       |                            |                |                 | processing to the cluster of finders.    |
|                         |                  |                            |                |                 | The response will be sent using the      |
|                         |                  |                            |                |                 | vehicle-query-partition-result-stats evt |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.viewer.<UID>      |                  | vehicle-query-result       | finder         | viewer          | While parsing the chunks, the finder     |
|                         |                  |                            |                |                 | will send all the move commands that     |
|                         |                  |                            |                |                 | match the criteria. The viewer will then |
|                         |                  |                            |                |                 | be able to replay them.                  |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.viewer.<UID>      |                  | vehicle-query-result-stats | finder         | viewer          | Once the query is complete, the finder   |
|                         |                  |                            |                |                 | will send some stats to the viewer, to   |
|                         |                  |                            |                |                 | measure the performance and processing   |
|                         |                  |                            |                |                 | that was required.                       |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+
| inbox.finder.<UID>      |                  | vehicle-query-             | finder         | finder          | This is a partial result sent back to    |
|                         |                  | partition-result-stats     |                |                 | the finder that delegated the            |
|                         |                  |                            |                |                 | processing.                              |
+-------------------------+------------------+----------------------------+----------------+-----------------+------------------------------------------+

Note that when a consumer uses a "consumer group" name, it means that the message will be handled only once by a member of the group. This is a regular work queue with competing consumers, which is different from the default pub/sub case.

Collecting move commands

sequenceDiagram
    viewer ->>+ generator: topic:generation/type:start-generation
    activate generator-agent
    generator ->> generator-agent: topic:generation-agent.*:generate-partition
    loop generate move events
        activate collector
        generator-agent ->> collector: topic:commands.move.*/type:move
        loop aggregate events
            collector ->> collector: bufferize, then flush
            collector -->> viewer: topic:events/type:aggregate-period-created
        end
        deactivate collector
    end
    generator-agent ->> generator: topic:inbox.generator.*/type:generate-partition-stats
    deactivate generator-agent
    note right of generator: force a flush at the end of the generation
    generator ->>+ collector: topic:commands.flush.*/type:flush
    collector -->>- viewer: topic:events/type:aggregate-period-created
    generator ->>- viewer: topic:inbox.viewer.*/type:generation-stats

Querying move commands

Serialized processing

sequenceDiagram
    viewer ->>+ finder: topic:query.vehicles/type:vehicle-query
    loop read event files
        loop for each matching position
            finder ->> viewer: topic:inbox.viewer.*/type:vehicle-query-result
        end
    end
    finder ->>- viewer: topic:inbox.viewer.*/type:vehicle-query-result-stats

Parallel processing

sequenceDiagram
    viewer ->>+ finder: topic:query.vehicles/type:vehicle-query
    par for each matching event file
        finder ->>+ finder-agent: inbox.finder.*/type:vehicle-query-partition-result-stats
        loop for each matching position
            finder-agent ->> viewer: topic:inbox.viewer.*/type:vehicle-query-result
        end
        finder-agent ->>- finder: topic:inbox.finder.*/type:vehicle-query-partition-result-stats
    end
    finder ->>- viewer: topic:inbox.viewer.*/type:vehicle-query-result-stats

Local dev

Features

web UI for viewing realtime data or query results
distributed services using a message broker
durable storage of events
distributed event generation with multiple instances for higher throughput
multiple data formats (parquet, csv, json, arrow)
multiple storage providers (filesystem, S3...)
flexible aggregation of events
- max window capacity
- concurrent time windows
- multiple data partitioning strategies (geohash, vehicle id...)
flexible search
- serialized or parallelized
- record limit
- timeout
- ttl
- time filters (time range)
- geoloc filters (polygons) using GeoJSON
- data filters (vehicle type)

Diagram


                            +------------+
      Event Generator-------| Event Hub  |------Event Viewer
                            | (NATS)     |
                            +------------+
                              |      |
                        +-----+      +-----+
                        |                  |
                  Event collector       Event finder
                   |    |                  |
        +----------+  writes             reads
        |               |                  |
+-------------+      +-----------------------+
| Event Store |      | Event Aggregate Store |
| (In memory) |      | (File system or S3)   |
+-------------+      +-----------------------+

Technologies:

Event Hub: NATS
Event Generator: Typescript NodeJS
Event Viewer: Nuxt server, Typescript, PixiJS
Event Collector: Typescript NodeJS, https://pola.rs/, Parquet format, S3, geohash
Event Store: In memory, DuckDB
Event finder: Typescript NodeJS, geohash, turf

Cloud

Azure

Inspired by https://learn.microsoft.com/en-us/azure/stream-analytics/event-hubs-parquet-capture-tutorial

Diagram

                                           +-----> Event Viewer
                                           |         + NuxtJS
Event Generator ==> Event Hub -------------+         + PowerBI
                     + Azure Event Hubs    |
                                           +----> Event Collector -------> Event Aggregate Store <------ Event finder
                                                  + Azure Stream Analytics      + Azure Blob                 + Azure Synapse

Event aggregation

See https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-overview and https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-capture-enable-through-portal and https://learn.microsoft.com/en-us/azure/stream-analytics/event-hubs-parquet-capture-tutorial

Querying

Create a new Azure Synapse Analytics resource

You should create a Blob storage account with the following attributes:

Azure Blob Storage or Azure Data Lake Storage Gen 2
Enable hierarchical namespace = true

Then you can create your new Azure Synapse Analytics resource and use the previously created Storage Account.

Configure the external data source

See also the tutorial that demonstrates all the steps: https://www.youtube.com/watch?v=WMSF_ScBKDY

Go to your Azure Synapse portal.
Select the "Develop" section.
click the "+" button and select "SQL script"
name it "configure_db"
paste the following script and execute it:

CREATE DATABASE vehicles;
GO;

USE vehicles;

-- You should generate a new password
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<REDACTED>';

CREATE DATABASE SCOPED CREDENTIAL VehicleDataCredential
WITH
    IDENTITY = 'SHARED ACCESS SIGNATURE',
    -- you should copy the SAS token configured for your Storage Account, in the "Shared access signature"
    SECRET = 'sv=2022-11-02&ss=b&srt=co&sp=rlf&se=2024-12-31T22:42:44Z&st=2024-12-15T14:42:44Z&spr=https&sig=<REDACTED>';

create external data source VehicleDataEvents with ( 
    -- you should replace morganvehicledata with the name of your Storage Account.
    location = 'wasbs://events@morganvehicledata.blob.core.windows.net',
    CREDENTIAL = VehicleDataCredential  
);

GO;

Create queries

The strategy is to use the right BULK filter in order to only select the files containing the data, based on the time range and geohash list.

Filter data and get stats

Go to your Azure Synapse portal.
Select the "Develop" section.
click the "+" button and select "SQL script"
name it "events_per_file"
paste the following script and execute it:

use vehicles;

SELECT ev.filename() as filename, COUNT(*) as event_count
FROM  
    OPENROWSET(
        BULK '2024-01-01-*-*-*-*.parquet',
        DATA_SOURCE = 'VehicleDataEvents',
        FORMAT='PARQUET'
    ) AS ev
WHERE
    ev.filepath(1) in ('05', '06', '07')
    AND ev.filepath(3) in ('f257v', 'f25k6', 'f25se', 'f25ss')
    AND ev.timestamp >= '2024-01-01T05:05:00'
    AND ev.timestamp < '2024-01-01T07:05:00'
GROUP BY
  ev.filename()
ORDER BY
  1

Filter data and get events

Go to your Azure Synapse portal.
Select the "Develop" section.
click the "+" button and select "SQL script"
name it "get_events"
paste the following script and execute it:

use vehicles;

SELECT ev.*
FROM  
    OPENROWSET(
        BULK '2024-01-01-*-*-*-*.parquet',
        DATA_SOURCE = 'VehicleDataEvents',
        FORMAT='PARQUET'
    ) AS ev
WHERE
    ev.filepath(1) in ('05', '06', '07')
    AND ev.filepath(3) in ('f257v', 'f25k6', 'f25se', 'f25ss')
    AND ev.timestamp >= '2024-01-01T05:05:00'
    AND ev.timestamp < '2024-01-01T07:05:00'
ORDER BY
  ev.timestamp

References

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-parquet-files https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-specific-files https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/tutorial-data-analyst

TODO

Use a service principal and/or workload identities instead of SAS tokens.

Technologies:

Event Hub: Azure Event Hubs
Event Generator: Typescript NodeJS
Event Viewer: Nuxt server, Typescript, PixiJS
Event Collector: Azure Stream Analytics
Event Store: Azure Event Hubs
Event Finder: Azure Synapse Analytics

AWS

Diagram

                                           +-----> Event Viewer
                                           |         + NuxtJS
Event Generator ==> Event Hub -------------+         
                     + AWS Eventbridge     |
                                           +----> Event Collector -------> Event Aggregate Store <------ Event finder
                                                  + Amazon Data Firehose       + AWS S3                     + Amazon Athena

Technologies:

Event Hub: Amazon Eventbridge
Event Generator: Typescript NodeJS
Event Viewer: Nuxt server, Typescript, PixiJS
Event Collector: Amazon Data firehose
Event Store: Amazon Eventbridge
Event finder: Azure Stream Analytics

File formats

JSON
CSV
Arrow
Parquet

Roadmap

Misc

create a script that would do the cost projections for the data that will be aggregated, stored and queried.

Development

NodeJS

Events

use https://github.com/cloudevents/spec
support auto reconnect when the server is not available
Support multiple messaging systems such as RabbitMQ, AWS SQS or Azure Event Hubs
Support event persistence
Support retry in message handling

Generator

Collector

use long term storage to persist events until they can be aggregated (instead of just using memory)
implement consistent hashing for paritioning geohashes in the instances
rewrite in Rust for speedup

Viewer

Web

use a mapping widget to display the vehicles?
try other Web frameworks?

Terminal

TODO: add a CLI that would listen to stats events and display aggregated stats.

Python

TODO

Rust

TODO

Local development

Note that you can also use Docker for local development if you prefer. Then jump to the next "Deployment/Docker" section.

Scripts

Requirements

Python3 for running some scripts.

install

python3 -m venv .venv .venv/bin/pip3 install -r scripts/requirements.txt

NodeJS

Requirements

NodeJS LTS should be installed.

NATS server should be running. See README for the instructions.

install

bash scripts/nodejs/cleaup.sh

bash scripts/nodejs/install.sh

run the web viewer (optional)

You can use the dev mode with hotreload but it is much slower than the production build, when there are a lot of vehicles:

cd ./event/viewer/web/nuxt
open http://localhost:3000/
npm run dev

Or you can build a production release and run it (this is much more performant):

cd ./event/viewer/web/nuxt
npm npm run build
open http://localhost:3000/
npm run preview

generate events

.venv/bin/python3 scripts/nodejs/start.py

Deployment

Requirements

docker
docker-compose
kubectl
nodejs

Docker

build

bash scripts/nodejs/docker-build.sh

run

The following script will start a docker-compose. Note that we run a single instance of each component in this mode.

bash scripts/nodejs/docker-run.sh

Kubernetes

cd deployment/kubernetes

npm i
npm run synth

Select the right Kubernetes cluster (either with the KUBECONFIG env var, or with a kubectl context).

kubectl apply -f dist/*

Wait for everything to be created.

Browse to http://vehicle-fleet-viewer.kube.lab.ile.montreal.qc.ca/

If you want to cleanup the kubernetes resources:

kubectl delete -f dist/0001-vehicles.k8s.yaml
kubectl delete -f dist/0000-vehicles-ns.k8s.yaml

References

https://medium.com/@igorvgorbenko/geospatial- data-analysis-in-clickhouse-polygons-geohashes-and-h3-indexing-2f55ff100fbe#:~:text=H3%20Indexing,-H3%20indexing&text=Similar%20to%20geohashes%2C%20a%20longer,occupy%20a%20fixed%208%20bytes.

https://medium.com/data-engineering-chariot/aggregating-files-in-your-data-lake-part-1-ed115b95425c https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html https://deepak6446.medium.com/why-did-we-move-from-mongodb-to-athena-with-parquet-297b61ddf299

https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-real-time-fraud-detection https://www.red-gate.com/simple-talk/cloud/azure/query-blob-storage-sql-using-azure-synapse/

https://hivekit.io/blog/how-weve-saved-98-percent-in-cloud-costs-by-writing-our-own-database/

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.docker		.docker
deployment		deployment
event-collector/nodejs		event-collector/nodejs
event-estimator		event-estimator
event-finder/nodejs		event-finder/nodejs
event-generator/nodejs		event-generator/nodejs
event-hub		event-hub
event-viewer/web/nuxt		event-viewer/web/nuxt
scripts		scripts
shared/javascript		shared/javascript
.gitignore		.gitignore
README.md		README.md
compose.yaml		compose.yaml
config.yaml		config.yaml

livetocode/vehicle-fleet-poc

Folders and files

Latest commit

History

Repository files navigation

Purpose

Functional requirements

Estimates

Architecture

Messages

Collecting move commands

Querying move commands

Serialized processing

Parallel processing

Local dev

Features

Diagram

Technologies:

Cloud

Azure

Diagram

Event aggregation

Querying

Create a new Azure Synapse Analytics resource

Configure the external data source

Create queries

Filter data and get stats

Filter data and get events

References

TODO

Technologies:

AWS

Diagram

Technologies:

File formats

Roadmap

Misc

Development

NodeJS

Events

Generator

Collector

Viewer

Web

Terminal

Python

Rust

Local development

Scripts

Requirements

install

NodeJS

Requirements

install

run the web viewer (optional)

generate events

Deployment

Requirements

Docker

build

run

Kubernetes

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages