Skip to content

Project-Assignment for Advanced DB Topics course at ECE NTUA

License

Notifications You must be signed in to change notification settings

emsquared2/Advanced-DB-NTUA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LA Crimes - Advanced DB Course Project NTUA

Overview

A semester team project for the Advanced Database Topics course of the Electrical & Computer Engineering class of the National Technical University of Athens. In this project we explore big data querying techniques using the LA Crimes Dataset from 2010 to present. Using Apache Spark and HDFS we set up a cluster environment to run our data analysis jobs in our own personal cluster.

Prerequisites

To replicate our project in your own cluster you need to already have established the following tools:

  • ~okeanos-knossos VM resources or other personal VMs
  • Apache Spark on your nodes
  • HDFS for a distributed file system in your cluster
  • Python - 3.10 or later
  • Java - openjdk 11 or later

Usage

Under the query folders you can find the code for each query implemented. For more information about each query you can find under the Docs folder.

Before using the scripts to run each query with the specified characteristics firstly you have to save the datasets in the HDFS. To do that you can run the following command which downloads the crime data directly to your HDFS:

Important! Change the "okeanos-master:54310" to your HDFS ip address

$ curl https://data.lacity.org/api/views/63jg-8b9z/rows.csv?accessType=DOWNLOAD | hadoop fs -put - hdfs://okeanos-master:54310/your/path/crime-data-from-2010-to-2019.csv
$ curl https://data.lacity.org/api/views/2nrs-mtv8/rows.csv?accessType=DOWNLOAD | hadoop fs -put - hdfs://okeanos-master:54310/your/path/crime-data-from-2020-to-present.csv

You can save the data wherever you want inside your HDFS but remember to also change the include path as follows:

# Contents of /utils/import_data

37.
38. crime_df1 = spark.read.csv("hdfs://okeanos-master:54310/your/path/crime-data-from-2010-to-2019.csv", header=True, schema=crime_schema)  
39. crime_df2 = spark.read.csv("hdfs://okeanos-master:54310/your/path/crime-data-from-2020-to-present.csv", header=True, schema=crime_schema)
40.
...
55. crime_rdd1 = spark.textFile("hdfs://okeanos-master:54310/your/path/crime-data-from-2010-to-2019.csv").map(lambda x: parse_csv(x))
...
60. crime_rdd2 = spark.textFile("hdfs://okeanos-master:54310/your/path/crime-data-from-2020-to-present.csv").map(lambda x: parse_csv(x))
...
78. income_df = spark.read.csv("hdfs://okeanos-master:54310/your/path/LA_income_2015.csv", header=True, schema=income_schema)
...
91. revgeocoding_df = spark.read.csv("hdfs://okeanos-master:54310/your/path/revgecoding.csv", header=True, schema=revgeocoding_schema)
...
107. police_stations_df = spark.read.csv("hdfs://okeanos-master:54310/your/path/LAPD_Police_Stations.csv", header=True, schema=police_stations_schema)
108.

For query 4 you also have to install the geopy package:

$ pip install geopy 

To run one of the queries you run each script like this:

$ ./scripts/run_query1.sh API_TYPE MODE
Usage: ./scripts/run_query1.sh <API_TYPE> <MODE>

API_TYPE: Specify the type of API (DF, SQL) used in the filename.
MODE: Specify the mode in which the job will be run (client or cluster).

So to run query1 with a Dataframe implementation on cluster mode just run:

$ ./scripts/run_query1.sh DF cluster

Some queries offer an implementation for RDD too (it is specified on the Usage: of that script) and some offer configuration to the executors used. Change the script command accordingly to use those features too.

For client mode the query output can be seen directly in the console.

For cluster mode on the other hand the output is saved in an output file in the HDFS. To make a local copy you can run the below command:

$ hadoop fs -getmerge hdfs://okeanos-master:54310/query-output-name ./desired_filename.txt

Authors

About

Project-Assignment for Advanced DB Topics course at ECE NTUA

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published