This repo provides scripts to download, process, and analyze data for billions of taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009. The data is stored in a PostgreSQL database, and uses PostGIS for spatial calculations.
Statistics through June 30, 2019:
- 2.45 billion total trips
- 1.65 billion taxi
- 800 million for-hire vehicle
- 279 GB of raw data
- Database takes up 378 GB on disk with minimal indexes
Create an instance of Postgres in Azure
Download raw data
python Setup/
Modify paths in the script and run it to load the csv data into the DB. Then populate the database:
python Setup/ --host="<server-name>" --port=5432 --user="<admin-username>" --dbname="<database-name>" --password="<admin-password>" --sslmode="require"
Analysis Additional Postgres and R scripts for analysis are in the analysis/ folder
Install Docker
To run the server:
docker run -d --name ht_pg_server -v ht_dbdata:/var/lib/postgresql/data -p 54320:5432 postgres:11
Check the logs to see if it is running:
docker logs -f ht_pg_server
Create the database:
docker exec -it ht_pg_server psql -U postgres -c "create database postgres"
Download raw data
python Setup/
Modify paths in the script and run it to load the csv data into the DB. Then populate the database:
python Setup/ --host="localhost" --port=5432 --user="<admin-username>" --dbname="postgres" --password="<admin-password>" --sslmode="allow"
Analysis Additional Postgres and R scripts for analysis are in the analysis/ folder, or you can do your own!