Brief Overview:
This project analyses socio-economic and crime data from various datasets to group US states based on their characteristics and identify factors that significantly predict assault rates. It was undertaken as part of a technical assessment for a job application at Singapore public agency.
The goal was to demonstrate analytical and statistical skills, including clustering and regression analysis, in the context of real-world data challenges. Insights from this project can aid in understanding socio-economic patterns and crime trends across the US.
Objectives:
- Group US states based on socio-economic and crime characteristics using clustering analysis.
- Identify significant predictors of assault rates through regression models.
- Provide data-driven insights to better understand socio-economic disparities and crime rates.
- Introduction
- Installation
- Usage
- Data
- Methodology
- Results
- Project Structure
- License
- Contact Information
Prerequisites:
- R version 4.0+
- RStudio
Dependencies: Install the required R libraries by running the following in R:
install.packages(c("tidyverse", "cluster", "factoextra", "broom", "car", "ggcorrplot", "usmap"))
Instructions:
-
Access States-Grouping_Assault-Prediction.md to view the pre-run analysis.
or
-
Download the folder, States_Grouping_Assault_Prediction.
-
Open States_Grouping_Assault_Prediction.Rmd in RStudio.
-
Knit the file to generate an HTML report. Alternatively, run the code chunks sequentially to reproduce the analysis.
Data Sources:
- USArrest.csv: Contains crime data, including assault rates, for US states.
- USstatex77.csv: Includes socio-economic indicators for US states.
Data Description:
- USArrest.csv: Columns include
Murder
,Assault
,UrbanPop
, andRape
. - USstatex77.csv: Contains indicators like
Population
,Income
,Illiteracy
, andLife Expectancy
.
Data Processing:
- Merged multiple datasets on the
State
column. - Normalised numeric variables for clustering.
- Handled missing data and multicollinearity issues for regression analysis.
Techniques Used:
- Clustering Analysis: Grouped states using K-means clustering with 4 clusters.
- Regression Analysis: Identified significant predictors of assault rates using multiple linear regression.
Tools:
- Programming Languages: R
- Libraries: tidyverse, cluster, factoextra, ggcorrplot, car, usmap
Findings:
- Cluster Profiles: States were grouped into clusters based on crime rates, income, urbanization, and education.
- Regression Insights: Higher urban population, higher rape rates, and lower life expectancy were associated with higher assault rates.
Visualisations:
- Cluster Visualization: Principal Component Analysis (PCA) plot showing clusters.
- Correlation Matrix: Displays relationships among variables.
Interpretation:
- Policies targeting urban population management and crime reduction can potentially mitigate assault rates.
Directory Tree:
States-Grouping_Assault-Prediction/
│
├── README.md # Project overview and instructions
├── States-Grouping_Assault-Prediction.md # Main analysis report
├── States-Grouping_Assault-Prediction.Rmd # Main analysis script
├── US_Arrest_Data/ # Data files (e.g., USArrest.csv, USstatex77.csv)
└── States-Grouping_Assault-Prediction_files/figure-gfm/ # Visuals used in the report
Key Files:
- States-Grouping_Assault-Prediction.md: The main Markdown report with analysis, viewable directly on GitHub.
- States-Grouping_Assault-Prediction.Rmd: The main R Markdown file containing the analysis.
- US_Arrest_Data/: Folder containing the input datasets.
- States-Grouping_Assault-Prediction_files/figure-gfm/: Folder to store generated outputs (e.g., reports, tables).
This project is intended for submission as part of a technical assessment. The content and code are not intended for public distribution, reproduction, or commercial use without explicit permission from the Author.
Author: Ou Yang Yu