A Daily Californian analysis of crime in the UC Berkeley area.
This repository contains tools for parsing and visualizing UCPD daily report logs from 2010 to 2015. Much of the code and methodology can be adapted to fit other data sources.
Clone the repo and install the requirements.
pip install -r requirements.txt
npm install
Set the following environment variables:
DB_NAME
: name of a PostGIS databaseDB_USER
: username with access to said database
If you'd like to deploy to S3 using django-bakery, set these as well:
AWS_BUCKET_NAME
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
To get started from scratch, run python manage.py load
, which will call:
load_bins
, to import hexagonal bins from a shapefile indata
load_ucpd
, to load historical UCPD crime dataclassify
, to collapse incident information into one of three categories: violent, property or quality-of-lifelocate
, to merge location information with the address database to assign each incident a latitude and longitudeassign_bin
, to locate each incident within a bincompute_stats
, to compute some basic statistics about crime across bins, across categories and over timepack
, to serialize incident-level information using Tamper
Incident-level reports come from a PRA filed with the UC Police Department. They cover January 2010 to September 2015. These raw data files are stored in data/ucpd
.
Hexagonal bins were generate in QGIS. The shapefile is stored in data/bins
.
Simple spreadsheet that maps the codes in the raw data to category codes: V
for violent crimes, P
for property crimes and Q
for quality-of-life crimes. N
is reserved for crimes that we aren't interested in analyzing or displaying.
A spreadsheet that maps the addresses in the raw data to geocoded points, which were manually corrected and checked. The address database lives in a Google spreadsheet here.
Tamper is a New York Times library for efficient serialization of data. We use Tamper as opposed to sending raw JSON in order to experiment with sending all incidents to the user's browser, then using Pourover to quickly sort and filter that data on the client-side.
This means we can't send coordinates for each individual incident. Instead, we assign incidents to a bin and then send only the incident's bin ID. With small enough bins, this gives a fairly detailed look at the spatial distribution of crime, and keeps the data file being sent remarkably light (41KB, in this case).
While it's more of an experiment than something of great use for data of this scale (~10 thousand incidents), it's an interesting model for scaling up to hundreds of thousands of incidents — something we've tried with historical data from the city police department.
Build this site out as flat files by running python manage.py build
.
If you've set the appropriate environment variables, publish to S3 using python manage.py publish
.
Thanks django-bakery!
We want to try scaling up this binning methodology to bigger datasets. That would involve creating a new shapefile and coming up with new address and classification dictionaries, but the rest of the loading, binning and serialization code should work.
We tried a few Pourover filters other than our basic classification (violent, property, quality-of-life), but none of them ended up being interesting for this particular dataset. For categorical variables, though, this project can accomplish some very fast visualizations of geospatial data — without running a server.