The aim here is to see if there are any associations between the reported aspects of street crime, such as Month of Year, Location, Crime type etc. This will be done in Pyspark due to the size of the data but it will still be possible to execute on a local cluster.
The data can be downloaded from here: https://data.police.uk/data/.
The date range for this data is December 2010 - July 2019 and all constabularies in England & Wales were selected (we will be excluding British Transport Police and Police Service of Northern Ireland)
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. This rule-based approach also generates new rules as it analyzes more data. The ultimate goal, assuming a large enough dataset, is to help a machine mimic the human brain’s feature extraction and abstract association capabilities from new uncategorized data.
We will be looking for rules with a high level of confidence
Confidence is an indication of how often the rule has been found to be true... Confidence can be interpreted as an estimate of the conditional probability
import glob
import os
import pandas as pd
import matplotlib.pyplot as plt
import calendar
import seaborn as sns
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
Running Spark locally using 6 out of 8 cores
from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder\
.master("local[7]")\
.appName("Crime Assocations")\
.config("spark.executor.memory", "6g")\
.config("spark.memory.fraction", 0.7)\
.getOrCreate()
sc = spark.sparkContext
# Set up a SQL Context
sqlCtx = SQLContext(sc)
#sc.stop()
from p01_load import load_data
The police data comes in several csv files with a folder for each Month-Year. Within each folder, there is a CSV file for each constabulary. We will concatenate these
path = glob.glob(os.getcwd() + "/all_data/*/*-street.csv")
police_data_df = load_data(file_locations=path, sqlcontext=sqlCtx)
Loading CSV files to sqlcontext...
Load Complete
police_data_df.select(police_data_df.columns[1:]).show()
+-------+--------------------+--------------------+---------+---------+--------------------+---------+--------------------+--------------------+---------------------+-------+
| Month| Reported by| Falls within|Longitude| Latitude| Location|LSOA code| LSOA name| Crime type|Last outcome category|Context|
+-------+--------------------+--------------------+---------+---------+--------------------+---------+--------------------+--------------------+---------------------+-------+
|2012-08|Metropolitan Poli...|Metropolitan Poli...|-0.508053|50.809718|On or near Claigm...|E01031464| Arun 007F| Violent crime| Under investigation| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| -1.01393|51.899297|On or near St Mic...|E01017673| Aylesbury Vale 010C| Other crime| Under investigation| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.964612|52.045416|On or near Barnes...|E01029896| Babergh 004E| Violent crime| Under investigation| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.145888|51.593835|On or near Provid...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.141143|51.590873|On or near Furze ...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140035|51.589112|On or near Beansl...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.137065|51.583672|On or near Police...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.137065|51.583672|On or near Police...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.135866|51.587336|On or near Gibbfi...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...| null| null|
+-------+--------------------+--------------------+---------+---------+--------------------+---------+--------------------+--------------------+---------------------+-------+
only showing top 20 rows
Each dataset contains the following columns:
police_data_df.printSchema()
root
|-- Crime ID: string (nullable = true)
|-- Month: string (nullable = true)
|-- Reported by: string (nullable = true)
|-- Falls within: string (nullable = true)
|-- Longitude: double (nullable = true)
|-- Latitude: double (nullable = true)
|-- Location: string (nullable = true)
|-- LSOA code: string (nullable = true)
|-- LSOA name: string (nullable = true)
|-- Crime type: string (nullable = true)
|-- Last outcome category: string (nullable = true)
|-- Context: string (nullable = true)
The Data Dictionary is as follows
dictionary = pd.read_csv('data_dictionary.csv')
pd.set_option('display.max_colwidth', -1)
for elem in dictionary.to_records(index=False):
print(elem[0] + ": " + elem[1])
Reported by: The force that provided the data about the crime.
Falls within: At present, also the force that provided the data about the crime. This is currently being looked into and is likely to change in the near future.
Longitude and Latitude: The anonymised coordinates of the crime. See Location Anonymisation for more information.
LSOA code and LSOA name: References to the Lower Layer Super Output Area that the anonymised point falls into, according to the LSOA boundaries provided by the Office for National Statistics.
Crime type: One of the crime types listed in the Police.UK FAQ.
Last outcome category: A reference to whichever of the outcomes associated with the crime occurred most recently. For example, this crime's 'Last outcome category' would be 'Formal action is not in the public interest'.
Context: A field provided for forces to provide additional human-readable data about individual crimes. Currently, for newly added CSVs, this is always empty.
NOTE: LSOA (Lower Layer Super Output Area)
From NHS Data Dictionary (https://www.datadictionary.nhs.uk/data_dictionary/nhs_business_definitions/l/lower_layer_super_output_area_de.asp?shownav=1)
"A Lower Layer Super Output Area (LSOA) is a GEOGRAPHIC AREA. Lower Layer Super Output Areas are a geographic hierarchy designed to improve the reporting of small area statistics in England and Wales. Lower Layer Super Output Areas are built from groups of contiguous Output Areas and have been automatically generated to be as consistent in population size as possible, and typically contain from four to six Output Areas. The Minimum population is 1000 and the mean is 1500. There is a Lower Layer Super Output Area for each POSTCODE in England and Wales"
How many Rows do we have?
num_rows = police_data_df.count()
num_rows
52835178
from p02_clean import clean_months, clean_location, clean_non_england
# The month column in the data is actually a Year-Month, here we will split that on the - delimiter and create a Year and Month_of_Year Column
police_data_clean = clean_months(police_data_df)
# Now lets create a Location and Town/City Column
police_data_clean = clean_location(police_data_clean)
police_data_clean = clean_non_england(police_data_clean)
Cleaning Year and Month Columns...
Creating Month_of_Year and Year columns...
Cleaning Complete
Cleaning Location and Town and City...
Cleaning Complete
Removing non England and Wales entries
Removal Complete
police_data_clean.printSchema()
root
|-- Crime ID: string (nullable = true)
|-- Month: string (nullable = true)
|-- Reported by: string (nullable = true)
|-- Falls within: string (nullable = true)
|-- Longitude: double (nullable = true)
|-- Latitude: double (nullable = true)
|-- Location: string (nullable = true)
|-- LSOA code: string (nullable = true)
|-- LSOA name: string (nullable = true)
|-- Crime type: string (nullable = true)
|-- Last outcome category: string (nullable = true)
|-- Context: string (nullable = true)
|-- Year: integer (nullable = true)
|-- Month_of_Year: string (nullable = true)
|-- Town_City: string (nullable = true)
import pyspark.sql.functions as F
import p03_eda as eda
Are there any cases of when the constabulary that reported the crime is different to the constabulary area?
police_data_clean\
.where(F.col("Reported by") != F.col("Falls within"))\
.count()
0
plt.figure(figsize=(20, 5))
crime_over_time_plot = eda.plot_crime_time_series(police_data_clean, read=True)
plt.show()
Setting Month to a categorical variable...
Setting complete
Converting to Series object...
Conversion complete
Creating plot object
Complete.. Plotting...
There appears to be a stationary trend with some periodicity with the numbers of reported crimes, although we do not have complete years in this dataset. It looks like there is a pattern to the level/numbers of street crimes!
plt.figure(figsize=(10, 10))
outcome_category_plot = eda.plot_crime_type_and_category_counts(police_data_clean, read=True)
plt.show()
Converting to Series
Collapsing Multi Index
Plotting...
It seems the most common type to outcome association is an anti social behaviour crime with no recorded outcome
plt.figure(figsize=(10, 5))
crime_type_plot = plot_crime_counts(police_data_clean, read=True)
plt.show()
Converting to Series object...
Plotting...
Anti-social behaviour makes up about 35% of crime in England - which is expected... It is concerning that violence and sexual offences is in second place
plt.figure(figsize=(10,5))
crime_town_city_counts = plot_crime_town_city_counts(police_data_clean, read=True)
plt.xticks(rotation=90)
plt.show()
Converting to Series
Plotting...
Let's look at what the feature engineering code is actually doing
%%bash
cat p04_feature_engineer.py
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, StringType
clear_duplicates = F.udf(lambda x: list(set(x)), ArrayType(StringType()))
def get_modelling_data(df):
select_cols = ["Falls within", "Town_City", "Crime type", "Last outcome category", "Month_of_Year"]
# Remove the crimes with no crime ID and no LSOA Information.
# Then select the features of interest
print('Filtering data with no Crime ID...')
police_data_modelling = df\
.filter(df["Crime ID"].isNotNull() & df["Last outcome category"].isNotNull())\
.select(select_cols)
print('Filtering complete')
return police_data_modelling
def make_item_sets(df):
# The FP growth algorithm (like association rules), needs the items to be concatenated into a list/array of "transactions".
print('Making item sets...')
print('Collapsing data to list of transactions')
police_item_set = df.withColumn("items_temp", F.array(df["Falls within"],
df["Town_City"],
df["Crime type"],
df["Last outcome category"],
df["Month_of_Year"]))
police_item_set = police_item_set.withColumn("items", clear_duplicates(police_item_set["items_temp"]))
# Select the items column and id
print('Adding increasing id column...')
police_item_set = police_item_set\
.select("items")\
.withColumn("id", F.monotonically_increasing_id())
print('Itemset creation complete')
return police_item_set
def feature_engineer(df):
"""Invoke the full feature engineering pipeline"""
print('Starting Feature Engineering pipeline...')
selected_data = get_modelling_data(df)
item_sets = make_item_sets(selected_data)
print('Feature Engineering complet')
return item_sets
from p04_feature_engineer import *
# Remove the crimes with no crime ID and no LSOA Information.
police_item_set = feature_engineer(police_data_clean)
Starting Feature Engineering pipeline...
Filtering data with no Crime ID and no outcome category..
Filtering complete
Making item sets...
Collapsing data to list of transactions
Adding increasing id column...
Itemset creation complete
Feature Engineering complet
The FP growth algorithm (like association rules), needs the items to be concatenated into a list/array of "transactions".
police_item_set.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------+
|items |
+----------------------------------------------------------------------------------------------------------------------------------+
|[Metropolitan Police Service, Aug, Arun, Violent crime, Under investigation] |
|[Other crime, Metropolitan Police Service, Aug, Aylesbury Vale, Under investigation] |
|[Metropolitan Police Service, Aug, Violent crime, Under investigation, Babergh] |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation] |
|[Barking and Dagenham, Metropolitan Police Service, Investigation complete; no suspect identified, Aug, Criminal damage and arson]|
|[Offender given a drugs possession warning, Drugs, Barking and Dagenham, Metropolitan Police Service, Aug] |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation] |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation] |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation] |
|[Barking and Dagenham, Metropolitan Police Service, Investigation complete; no suspect identified, Aug, Violent crime] |
|[Offender sent to prison, Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime] |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime, Under investigation] |
|[Offender given a caution, Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime] |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime, Under investigation] |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime, Under investigation] |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation] |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation] |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation] |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation] |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation] |
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 20 rows
For Association rules
from p05_model import build_association_rule_model, extract_model_rules
# Use a low support as we have a large dataset
model = build_association_rule_model(police_item_set, min_support=0.01, min_confidence=0.6)
Fitting FPGrowth....
Fit Complete
rules_df_pd = extract_model_rules(model)
Extracting Rules...
Rule extraction complete
Collecting Rules to Pandas...
Collection Complete...
rules_df_pd
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
antecedent | consequent | confidence | lift | |
---|---|---|---|---|
0 | Theft from the person | Investigation complete; no suspect identified | 0.601021 | 1.274382 |
1 | Greater Manchester Police | Investigation complete; no suspect identified | 0.603717 | 1.280098 |
2 | West Midlands Police | Investigation complete; no suspect identified | 0.606495 | 1.285988 |
3 | Birmingham | Investigation complete; no suspect identified | 0.607874 | 1.288914 |
4 | Birmingham,West Midlands Police | Investigation complete; no suspect identified | 0.608291 | 1.289796 |
5 | Unable to prosecute suspect | Violence and sexual offences | 0.621860 | 2.453091 |
6 | Criminal damage and arson | Investigation complete; no suspect identified | 0.641662 | 1.360556 |
7 | Manchester | Investigation complete; no suspect identified | 0.650860 | 1.380060 |
8 | Manchester,Greater Manchester Police | Investigation complete; no suspect identified | 0.651027 | 1.380413 |
9 | Other theft | Investigation complete; no suspect identified | 0.666711 | 1.413669 |
10 | Vehicle crime | Investigation complete; no suspect identified | 0.692018 | 1.467329 |
11 | Burglary | Investigation complete; no suspect identified | 0.711254 | 1.508117 |
12 | Bicycle theft | Investigation complete; no suspect identified | 0.719270 | 1.525113 |
13 | Birmingham | West Midlands Police | 0.998434 | 20.240328 |
14 | Birmingham,Investigation complete; no suspect ... | West Midlands Police | 0.999117 | 20.254189 |
15 | Sheffield | South Yorkshire Police | 0.999125 | 35.791218 |
16 | Leeds | West Yorkshire Police | 0.999156 | 18.917528 |
17 | Westminster | Metropolitan Police Service | 0.999549 | 5.338211 |
18 | Bradford | West Yorkshire Police | 0.999584 | 18.925635 |
19 | Liverpool | Merseyside Police | 0.999654 | 38.381717 |
20 | Manchester | Greater Manchester Police | 0.999672 | 16.812479 |
21 | Bristol | Avon and Somerset Constabulary | 0.999688 | 34.073433 |
22 | Manchester,Investigation complete; no suspect ... | Greater Manchester Police | 0.999928 | 16.816784 |
rules_df_pd.to_csv('crime_associations.csv')
# Stop the Spark Session
sc.stop()
As you can see, the rules in the 98%+ confidence region appear to be rules that don't really tell us anything. i.e. Birmingham -> West Midlands Police. Let's remove those from the analysis
useful_rules_df = rules_df_pd[rules_df_pd['confidence'] < 0.98]\
.sort_values(by="confidence", ascending = False)
useful_rules_df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
antecedent | consequent | confidence | lift | |
---|---|---|---|---|
12 | Bicycle theft | Investigation complete; no suspect identified | 0.719270 | 1.525113 |
11 | Burglary | Investigation complete; no suspect identified | 0.711254 | 1.508117 |
10 | Vehicle crime | Investigation complete; no suspect identified | 0.692018 | 1.467329 |
9 | Other theft | Investigation complete; no suspect identified | 0.666711 | 1.413669 |
8 | Manchester,Greater Manchester Police | Investigation complete; no suspect identified | 0.651027 | 1.380413 |
7 | Manchester | Investigation complete; no suspect identified | 0.650860 | 1.380060 |
6 | Criminal damage and arson | Investigation complete; no suspect identified | 0.641662 | 1.360556 |
5 | Unable to prosecute suspect | Violence and sexual offences | 0.621860 | 2.453091 |
4 | Birmingham,West Midlands Police | Investigation complete; no suspect identified | 0.608291 | 1.289796 |
3 | Birmingham | Investigation complete; no suspect identified | 0.607874 | 1.288914 |
2 | West Midlands Police | Investigation complete; no suspect identified | 0.606495 | 1.285988 |
1 | Greater Manchester Police | Investigation complete; no suspect identified | 0.603717 | 1.280098 |
0 | Theft from the person | Investigation complete; no suspect identified | 0.601021 | 1.274382 |
Now the rule with the highest confidence is (Bicycle Theft -> Investigation complete; no suspect identified). So what does this mean? This means that given that a crime is a Bike Theft, the probability the investigation will be complete with no suspect identified is around 72%
The other 3 rules in the 65%+ confidence/conditional probability region follow a similar pattern.
- (Other theft -> Investigation complete; no suspect identified)
- (Burglary -> Investigation complete; no suspect identified)
- (Vehicle Crime -> Investigation complete; no suspect identified)
So, it implies that the probability of no suspect being identified after a burglary, vehicle crime, an incident of criminal damage or arr is about 69-72%.
Another interesting rule is (Manchester -> Investigation complete; no suspect identified). So what this is saying is, the model estimates that the probability that a reported crime leads to a complete investigation with no suspect identified, given that the crime occurred in Manchester around 65%
Another block of these rules is (Unable to prosecute suspect -> Violence and sexual offences), which sounds worrying but doesn't really say much. The conditional probability of a crime being a violence and sexual offence, given that you were unable to prosecute the suspect is around 61%.
sc.stop()