LA Crimes - Advanced DB Course Project NTUA

Overview

A semester team project for the Advanced Database Topics course of the Electrical & Computer Engineering class of the National Technical University of Athens. In this project we explore big data querying techniques using the LA Crimes Dataset from 2010 to present. Using Apache Spark and HDFS we set up a cluster environment to run our data analysis jobs in our own personal cluster.

Prerequisites

To replicate our project in your own cluster you need to already have established the following tools:

~okeanos-knossos VM resources or other personal VMs
Apache Spark on your nodes
HDFS for a distributed file system in your cluster
Python - 3.10 or later
Java - openjdk 11 or later

Usage

Under the query folders you can find the code for each query implemented. For more information about each query you can find under the Docs folder.

Before using the scripts to run each query with the specified characteristics firstly you have to save the datasets in the HDFS. To do that you can run the following command which downloads the crime data directly to your HDFS:

Important! Change the "okeanos-master:54310" to your HDFS ip address

$ curl https://data.lacity.org/api/views/63jg-8b9z/rows.csv?accessType=DOWNLOAD | hadoop fs -put - hdfs://okeanos-master:54310/your/path/crime-data-from-2010-to-2019.csv
$ curl https://data.lacity.org/api/views/2nrs-mtv8/rows.csv?accessType=DOWNLOAD | hadoop fs -put - hdfs://okeanos-master:54310/your/path/crime-data-from-2020-to-present.csv

You can save the data wherever you want inside your HDFS but remember to also change the include path as follows:

# Contents of /utils/import_data

37.
38. crime_df1 = spark.read.csv("hdfs://okeanos-master:54310/your/path/crime-data-from-2010-to-2019.csv", header=True, schema=crime_schema)  
39. crime_df2 = spark.read.csv("hdfs://okeanos-master:54310/your/path/crime-data-from-2020-to-present.csv", header=True, schema=crime_schema)
40.
...
55. crime_rdd1 = spark.textFile("hdfs://okeanos-master:54310/your/path/crime-data-from-2010-to-2019.csv").map(lambda x: parse_csv(x))
...
60. crime_rdd2 = spark.textFile("hdfs://okeanos-master:54310/your/path/crime-data-from-2020-to-present.csv").map(lambda x: parse_csv(x))
...
78. income_df = spark.read.csv("hdfs://okeanos-master:54310/your/path/LA_income_2015.csv", header=True, schema=income_schema)
...
91. revgeocoding_df = spark.read.csv("hdfs://okeanos-master:54310/your/path/revgecoding.csv", header=True, schema=revgeocoding_schema)
...
107. police_stations_df = spark.read.csv("hdfs://okeanos-master:54310/your/path/LAPD_Police_Stations.csv", header=True, schema=police_stations_schema)
108.

For query 4 you also have to install the geopy package:

$ pip install geopy

To run one of the queries you run each script like this:

$ ./scripts/run_query1.sh API_TYPE MODE
Usage: ./scripts/run_query1.sh <API_TYPE> <MODE>

API_TYPE: Specify the type of API (DF, SQL) used in the filename.
MODE: Specify the mode in which the job will be run (client or cluster).

So to run query1 with a Dataframe implementation on cluster mode just run:

$ ./scripts/run_query1.sh DF cluster

Some queries offer an implementation for RDD too (it is specified on the Usage: of that script) and some offer configuration to the executors used. Change the script command accordingly to use those features too.

For client mode the query output can be seen directly in the console.

For cluster mode on the other hand the output is saved in an output file in the HDFS. To make a local copy you can run the below command:

$ hadoop fs -getmerge hdfs://okeanos-master:54310/query-output-name ./desired_filename.txt

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Docs		Docs
query1		query1
query2		query2
query3		query3
query4		query4
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
print_crime_df_metadata.py		print_crime_df_metadata.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LA Crimes - Advanced DB Course Project NTUA

Overview

Prerequisites

Usage

Authors

About

Releases

Packages

Contributors 2

Languages

License

emsquared2/Advanced-DB-NTUA

Folders and files

Latest commit

History

Repository files navigation

LA Crimes - Advanced DB Course Project NTUA

Overview

Prerequisites

Usage

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages