A Linear Regression Model of US Crime Rate Prediction

Mengfan Luo, Yushan Wang, Jing Lyu, Yiqun Jin, Mingkuan Xu

Abstract

In this project, we aim to build regression models based on a set of demographic variables to estimate county-level crime rates. After an exploratory analysis of the variables on their distributions and correlations, we derived more meaningful variables by manipulating the existing ones, removed outlier values, and implemented several variable selection methods. Using the selected variables, we trained a linear regression model, elaborated on the model by adding several interactive terms, and did cross-validations. All 3 resulting models have achieved a good estimation of the training set. We selected the third model (adjusted R-square: 0.542, RMSE: 16.46, RMSPE: 11.90) as the final model, as it has the best prediction on testing set. Further studies can be done on correcting the dataset using external data sources, as well as using more sophisticated non-linear models.

Introduction

Over the last three decades, crime has become a major public concern in the US arousing massive political discussion and public expenditure[1]. Crime rates in major cities experienced a general rise from the 1960s to 1990s, with two peaks observed in 1980 and in early 1990s[2]. Despite extensive attention across the nation, factors influencing crime trends were not yet made clear[1]. In this project, we examined crime rate and potential factors that affect the crime rate in “County Demoraphic Information” (CDI), and constructed multiple linear regression model to predict crime rate.

Methods

Data Description

We analyzed data from the “County Demographic Information” (CDI) data set, which contains characteristics of 440 counties in the United States collected from 1990-1992. The primary goal of this investigation is to develop insight relevant to predicting the crime rate in counties.

Data Preprocessing Transform vairables in order to extract interpretable information.

Exploratory Analysis Calculate the pairwise correlations between variables and ist all correlations between the crime rate (our interest) and other variables.

Statistical Methods

Training/Testing Set Split: Randomly split the dataset into training (90%) and testing sets (10%) to avoid overfitting.

Remove Outliers and High Leverage Points: Use percentile to detect potential outliers and high leverage points. Remove rows containing the smallest and largest 0.2% for each variable given the size of the dataset.

Variable Selection: Select key variables using stepwise regression and criteria based procedure and conduct interaction analysis,

Model Construction: Build linear regression models using selected variables, interaction terms (p-value < 0.2). Attempt model transformations.

Cross Validation: Use 5-fold cross validation on each model to estimate model performances.

Model Assessment: Assess the models and choose a final model used for future prediction. Compare the models based on three criteria: the R-square values, the root mean square error (RMSE), and the root mean square prediction error (RMSPE).

R-square value represents the proportion of the variance that can be explained by the regression model.
RMSE measures the differences between the actual values and the predicted values in the training dataset.
RMSPE estimates the prediction errors on new data outside the training dataset.

Results

Descriptive Analysis

After importing the csv file containing the County Demographic Information (CDI) data, we noticed that crimes, physicians, and hospital beds are given as numbers, while other info are given as proportions. We therefore computed the number of crimes, physicians, and hospital beds per 1000 people. Population density could be a key factor to crime rate. Thus, we also derived a new variable, density, which is population divided by area.

After drawing boxplots to show the distribution of the variables, we identified several extreme values in each of the variables. These values can be treated as potential outliers to be removed in further analysis. For example, the distribution of crime rate:

From the pairwise plots, we did not recognize any relationship between each variable. After taking a closer look of each variable, we calculated pairwise correlations and listed all the correlations between the crime rate (our interest) and other variables. Correlation analysis suggested that the derived variable, density, is a more meaningful variable compared to area and population, with a stronger association to the crime rate.

After a preliminary analysis of the data, we identified several variables that might be relevant to the crime rate as listed:

Variable	Meaning
area	Land area
density	Population Density
pop	Esimate 1990 population
pop18	Percent of population aged 18-34
pop65	Percent of population aged 65+
docs_rate_1000	Number of active physicians per 1000 people
beds_rate_1000	Number of hospital beds per 1000 people
crime_rate_1000	Number of serious crimes per 1000 people
hsgrad	Percent high school graduates
bagrad	Percent bachelor’s degrees
poverty	Percent below poverty level
unemp	Percent unemployment
pcincome	Per capita income
totalinc	Total personal income
region	Geographic region (1=Northeast, 2=North Central, 3=South, 4=West)

Potential Variables Relevant to the Crime Rate

Traning/Testing Split

We randomly sampled 10% from the dataset as a testing set (n = 44) and put the rest into a training set (n = 396).

Data Cleaning

We removed 19 rows with outliers and high leverage points from a total of 396 training data.

Variables Selection

Based on stepwise procedure, we selected the following variables:

backward	stepwise
pop	pop
pop18	pop18
bagrad	bagrad
poverty	poverty
pcincome	pcincome
totalinc	totalinc
region2	region2
region3	region3
region4	region4
beds_rate_1000	beds_rate_1000

According to the output of criteria based procedure, we determined that the number of variables should be above 12 because C_p ≤ p. Based on this analysis, we found that unemp and density could also be selected.

In addition, We removed totalinc, because it can be replaced with totalinc = pcincome * pop.

Interaction Analysis

Interaction term 1: poverty + income

According to the Census Bureau, the number of people below the official government poverty level was 33.6 million in 1990, representing 13.5 percent of the national population [4]. Thus, we used this criteria to divide poverty into two categories: higher than national poverty rate and lower than national poverty rate. In figure 6, we observed an intersection of the two lines, suggesting that the association between crime rate and per capita income is modified by poverty status.

Interaction term 2: pcincome + bagrad

According to the Census Bureau, the percentage of people 25 years old or older with bachelor’s degrees was 20.8% in 1990 [4]. Thus, we used this criteria to divide bagrad into two categories: higher than national bagrad and lower than national bargrad. In figure 7, we observed that the association between crime rate and per capita income is modified by percent bachelor’s degrees status.

Model Construction

By adding interative terms to the model built with selected variables, we resulted in the following three models:

Model 1 was built using the following terms: pop, pop18, bagrad, poverty, unemp, pcincome, pcincome*pop, region, beds_rate_1000, density.

Model 2 was built using the following terms: pop, pop18, bagrad, poverty, unemp, pcincome, pcincome*pop, region, beds_rate_1000, poverty*pcincome

Model 3 was built using the following terms: pop, pop18, bagrad, poverty, unemp, pcincome, pcincome*pop, region, beds_rate_1000, density, pcincome*bagrad

Model Diagnosis and Transformation

Using Variable Inflation Factor(VIF), we determined if each model had a high collinearity. The above 3 models all have passed the collinearity test.

We then drew diagnostic plots for each model to see how the residuals behaved. In all three models, residuals followed a normal distribution (μ = 0) with no influential points.

In addition, we drew boxcox plots to see if each model needed transformation. The peak of all three boxcox plots fell between 0.5 and

As such, we tried $\sqrt{y}$ transformation for each model. Since the residuals were more unevenly distributed in all three transformed models, we kept the untransformed model. Detailed plots can be seen in main.Rmd.

Here is a summary table of for our three models:

Cross Validation

We performed cross validation on each model and got the RMSE average. The results were shown in the table below:

Model Assessment

We performed cross validation on each model and got the RMSE average and R-square values. We further assessed the models in the testing set by RMSPE:

Model	R Square	RMSE	RMSPE
1	0.534	16.60	12.06
2	0.561	16.20	12.32
3	0.542	16.46	11.90

Conclusion and Discussion

According to table 5, evaluated in terms of accuracy in estimating the training set, model 2 has the best performance with its leading R square and the smallest RMSE, followed by model 3 and model 1. In terms of predicting values outside the training set, model 3 has the best performance with the smallest RMSPE on the testing set. Such an inconsistency of model performance may be explained by model overfitting of the training data. Above all, although model 2 have achieved a good estimation of the training set, we will choose model 3 as the final model, given its fair RMSE and R-square values, plus excellent testing set performance. By using model 3, we can predict the crime rate by the following parameters.

Overall, our project has several strengths. First, we did feature engineering before training the model by transforming variables using our domain knowledge to the ones more relevant to the predicted variable. For example, while area and population are not directly related to the crime rate, population density (population/area) can be more relevant. Second, we did analysis of correlation and collinearity to reduce potential bias. Third, we did interactive analysis and involved multiple interactive terms in our model, which to some extents represented the possible non-linear relations between the parameters and the predicted value. Finally, we separated a testing set from the dataset at the very beginning, on which we evaluated the performance of several models, addressing the potential predicting errors caused by model overfitting.

Meanwhile, the project also has its limitations. Essentially, we identified several wrong data points in the original dataset: for example, the population of Los Angles, an outstandingly large number, is not consistent with the number found on Wikipedia. Given that some of the data are mistaken, further works can be done on correcting the dataset using external data sources. Furthermore, our regression model considered only linear and interactive terms, while some parameters could be better fitted using polynomials or exponentials. More sophisticated models can be used in the future to better estimate the data.

Reference and Documentation

[1] Committee on Law and Justice, et al. Understanding Crime Trends: Workshop Report. Edited by Arthur S. Goldberger and Richard Rosenfeld, National Academies Press, 2009. Accessed 11 December 2021.

[2] Rosenfeld, R., Vogel, M. & McCuddy, T. Crime and Inflation in U. S. Cities. J Quant Criminol 35, 195–210 (2019). https://doi.org/10.1007/s10940-018-9377-x

[3] U.S. Department of Commerce Economics and Statistics Administration, Bureau of the Census. Current Population Reports. Poverty in the United States: 1990. Series P-60, No.175

[4] Bureau of the Census. We asked… You told us. Census Questionnaire Content, 1990 CQC-13

[5] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001.

We also referenced several online materials during our discussion:

Question: Do we still need a test set when using k-fold cross-validation? Source: https://stats.stackexchange.com/questions/225949/do-we-need-a-test-set-when-using-k-fold-cross-validation https://datascience.stackexchange.com/questions/80310/is-a-test-set-necessary-after-cross-validation-on-training-set
Question: How to achieve build test set & predict Source: https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/ https://campus.datacamp.com/courses/machine-learning-with-caret-in-r/regression-models-fitting-them-and-evaluating-their-performance?ex=8
Question: How to evaluate continuous by continuous interactions Source: Continuous by Continuous Interactions, Joel S Steele http://web.pdx.edu/~joel8/resources/ConceptualPresentationResources/ContinuousByContinousInteractions_walkthrough_v2.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
Lyu-Jing_files/figure-gfm		Lyu-Jing_files/figure-gfm
MingkuanXu_files/figure-gfm		MingkuanXu_files/figure-gfm
YiqunJin_files/figure-gfm		YiqunJin_files/figure-gfm
Yushan_Wang_files/figure-gfm		Yushan_Wang_files/figure-gfm
data		data
image		image
main_files/figure-gfm		main_files/figure-gfm
report_files/figure-gfm		report_files/figure-gfm
report_github_files/figure-gfm		report_github_files/figure-gfm
.gitignore		.gitignore
Lyu Jing.Rmd		Lyu Jing.Rmd
Lyu-Jing.md		Lyu-Jing.md
MingkuanXu.Rmd		MingkuanXu.Rmd
MingkuanXu.md		MingkuanXu.md
README.md		README.md
YiqunJin.Rmd		YiqunJin.Rmd
YiqunJin.md		YiqunJin.md
Yushan_Wang.Rmd		Yushan_Wang.Rmd
Yushan_Wang.md		Yushan_Wang.md
Yushan_clean.Rmd		Yushan_clean.Rmd
final-project.Rproj		final-project.Rproj
main.Rmd		main.Rmd
main.html		main.html
main.log		main.log
main.md		main.md
main.pdf		main.pdf
main.tex		main.tex
mengfanluo.Rmd		mengfanluo.Rmd
mengfanluo.html		mengfanluo.html
mengfanluo_copy.Rmd		mengfanluo_copy.Rmd
p8130_final_report.pdf		p8130_final_report.pdf
report.Rmd		report.Rmd
report.md		report.md
report_github.Rmd		report_github.Rmd
report_github.log		report_github.log
report_github.md		report_github.md
report_github.pdf		report_github.pdf
report_v2.Rmd		report_v2.Rmd
report_v2.log		report_v2.log
report_v2.pdf		report_v2.pdf
summary_model.Rmd		summary_model.Rmd
table.html		table.html
table4.png		table4.png
temp.html		temp.html
test model.Rmd		test model.Rmd
test.html		test.html
test_legaldesc.pdf		test_legaldesc.pdf
webshot.png		webshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Linear Regression Model of US Crime Rate Prediction

Abstract

Introduction

Methods

Data Description

Statistical Methods

Results

Descriptive Analysis

Traning/Testing Split

Data Cleaning

Variables Selection

Interaction Analysis

Model Construction

Model Diagnosis and Transformation

Cross Validation

Model Assessment

Conclusion and Discussion

Reference and Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Linear Regression Model of US Crime Rate Prediction

Abstract

Introduction

Methods

Data Description

Statistical Methods

Results

Descriptive Analysis

Traning/Testing Split

Data Cleaning

Variables Selection

Interaction Analysis

Model Construction

Model Diagnosis and Transformation

Cross Validation

Model Assessment

Conclusion and Discussion

Reference and Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages