In this project, we apply timer series analysis on NYC crime dataset and Chicago crime dataset. The time series method applied are ARIMA and seasonal ARIMA.
Provide a more detailed explanation of the project here. This could include what problem the project solves, what technologies it uses, and any important design decisions.
.
├── README.md
├── ana_code
│ ├── analysis_chicago_all.ipynb
│ ├── analysis_chicago_type1.ipynb
│ ├── analysis_chicago_type2.ipynb
│ ├── analysis_merged_all.ipynb
│ ├── analysis_merged_type1.ipynb
│ ├── analysis_merged_type2.ipynb
│ ├── analysis_nyc_all.ipynb
│ ├── analysis_nyc_type1.ipynb
│ └── analysis_nyc_type2.ipynb
├── data
│ ├── battery_occurrence_per_day.csv
│ ├── crime_occurrence_per_day.csv
│ ├── nypd_all.csv
│ ├── nypd_assault.csv
│ ├── nypd_larceny.csv
│ └── theft_occurrence_per_day.csv
├── data_ingest
│ ├── tz2076
│ │ └── ingest.sh
│ └── yx2021
│ └── ingest.sh
├── etl_code
│ ├── tz2076
│ │ └── cleaning.scala
│ └── yx2021
│ ├── pipeline
│ │ ├── etl1.scala
│ │ └── etl2.scala
│ └── shell
│ ├── etl1.scala
│ ├── etl2.scala
│ └── loading.scala
├── output
│ ├── chicago_all_pred.jpg
│ ├── chicago_type1_pred.jpg
│ ├── chicago_type2_pred.jpg
│ ├── merged_all_pred.jpg
│ ├── merged_type1_pred.jpg
│ ├── merged_type2_pred.jpg
│ ├── nyc_all_pred.jpg
│ ├── nyc_type1_pred.jpg
│ └── nyc_type2_pred.jpg
└── profiling_code
├── tz2076
│ └── profiling.scala
└── yx2021
└── profiling.scala
NYC Crime data Chicago Crime data
NYC Crime Data Chicago Crime Data
- Download two datasets from the source websites.
- Upload to NYU HPC (High-Performance Computer).
- Put data onto HDFS for big data processing with command
hdfs dfs -put local_file_path hdfs_file_path
which can be found in folderdata_ingest
.
The input data is located at: hdfs://nyu-dataproc-m/user/tz2076_nyu_edu/final_project/data/nypd_raw_data.csv
.
The code reads in the data that I provide and does data cleaning job through spark.
Below are the column that I selected to be used in further work:
|-- id: integer (nullable = true)
|-- date: date (nullable = true)
|-- typeId: integer (nullable = true)
|-- typeDesc: string (nullable = true)
This dataframe is saved to be used in the profiling process, located at hdfs://nyu-dataproc-m/user/tz2076_nyu_edu/final_project/data/nypd_cleaned.csv
.
After that, I aggregate information from the data.
- In extraction stage, I filter out two extra dataframes which only contains either LARCENY crimes or ASSAULT crimes with regular expression based methods.
- Use the original dataframe which contains all crime types and these two dataframes, I aggregate the number of crimes in each day with
groupby("date")
separately. - Save those data separately in the directory
final_project/data
.
Here are the files of processed data:
|-- `nyc_all.csv`
|-- `nyc_assault.csv`
|-- `nyc_larceny.csv`
These three aggregated dataframes contains these two columns:
|-- date: date (nullable = true)
|-- count: integer (nullable = true)
All work could be done by running spark-shell --deploy-mode client -i cleaning.scala
on HPC.
This stage explores the intermediate data nypd_cleaned.csv
from cleaning stage and generate some information about the dataset.
- Cast
id
andtypeId
into INTEGER. - Cast
date
into type date object with the formatMM/dd/yyyy
. - Find which week the date was in, and store the information as
week_number
. - Get distinct value in
date
andtypeId
. - Check the mean value for
week_number
. - Group the records by
typeId
andtypeDesc
and aggregate the counts of them.
All work could be done by running spark-shell --deploy-mode client -i profiling.scala
on HPC.
Use file in profiling/yx2021/profiling.scala
to profile the dataset. Steps include:
- Show dataset schema.
- Count distinct values.
- Find all distinct values for categorical variables.
Use files in etl_code/yx2021/
to do data cleaning and transforming. The files in folder pipeline
and folder shell
perform same functionality. Each file in folder pipeline
can be executed as a whole scala file with command spark-shell --deploy-mode client -i FILENAME.scala
. On the other hand, files in folder shell
were written for user to run line by line in an scala interactive shell so that user could get a sense about how the data set is processed. Cleaning and transforming steps include:
etl1.scala
- Drop unnecessary columns.
- Convert date filed into date object.
- Drop row that contains NaN values.
- Write the cleaned dataset to
crime_type_data.csv
etl2.scala
- Aggregate data on date to get the crime occurrence on each day.
- Filter data on crime type "theft" and then aggregate data on date to get
theft_occurrence_per_day.csv
. - Filter data on crime type "battery" and then aggregate data on date to get
battery_occurrence_per_day.csv
.
*loading.scala
Code in this file is ued to open scala shell and load raw data Chi_Crimes_2001_to_Present.csv
into a dataframe in scala.
- Install google cloud
gcloud
on your local machine. gcloud installation tutorial - In your terminal, run
gcloud auth login
. A web page will be prompt to ask you to login your authorized google account. - On HPC, your home directory, run command
jupyter-notebook
. It will prompt a URL for you to copy and paste in a new page. - From the output produced by jupyter-notebook, obtain the port number that the notebook is running on. Run below command in your terminal and replaced both
PORT
of this command with the port number:
gcloud compute ssh nyu-dataproc-m --project hpc-dataproc-19b8 --zone us-central1-f -- -N -L PORT:localhost:PORT
For example, if notebook is running on 8888, the command should be:
gcloud compute ssh nyu-dataproc-m --project hpc-dataproc-19b8 --zone us-central1-f -- -N -L 8888:localhost:8888
- Copy and paste the URL you obtain from step 3 into a new web page, now you should see the interface of jupyter notebook.
In total we have 3(cities) * 3(crime types) = 9 notebooks. Cites include Chicago, NYC, merged of both. Crime types include all crime types as whole, type 1 (theft/larceny), and type 2 (battery/assault). We sued type 1 and type 2 because NYC and Chicago encode the crime types differently. Based on our research on crime type encoding system, we recognize theft and larceny as type 1 and battery and assault as type 2.
All notebooks contain same analysis pipeline include below steps:
- Import required package and resolve dependency.
- Load datasets, initialized dataset specific variables, setup data frame (convert to date object and set as index).
- Data visualization of crime frequency in different scales (daily, weekly, monthly). All following steps is performed on monthly data, because the monthly data show more clear pattern.
- Check whether dataset is stationary with
adfuller_test()
, augmented Dickey-Fuller test. - Performing differencing in shift of 12 month (Compute the difference between each observation and the corresponding observation from the same month in the previous year) to remove the macro trend component from the data.
- Perform
adfuller_test()
again on the data after differencing and check whether it is stationary. - Create
ACF
andPACF
plot to find potential lag value with high auto-correlation and partial auto-correlation. - Split dataset into training and testing set.
- Training the seasonal ARIMA model and optimize it with parameter tunning.
- Show model summary.
- Use the model to predict on the test and visualize the result of true values and predicted values.
- Output the plots in folder
output
. - Evaluate each model on test set by RMSE (root mean square error).
The model we trained in this project could be used to predict future crime peaks. Specifically, our model works best on predicting type1 crime. On the other hand, model that predict single city performs better than predicted on merged dataset.
If interested, follow above instruction to reproduce the result and feel free to do modification and create a issue or pull request if you got any interesting insights!