A simple example of JDBC and Apache Hive integration in Apache Spark.
Save relevant information for each delayed flight. A flight is considered delayed if the delay is greater than 15 minutes.
In particular, the following data must be saved:
- tail number (i.e. the civil registration or military serial number)
- aircraft type
- construction year of the aircraft
- flight time (i.e. how long the flight lasted)
- delay
- the ratio of delay to flight time
In order to get all the required data, two datasets should be used:
- the Flight dataset
- the Plane dataset
Yet, these two datasets reside on two different systems:
- the Flight dataset is contained in a structured file loaded into a Hive table
- the Plane dataset is contained in a Relational Database
We need Apache Spark to load both datasets from the respective systems so that the ensuing query can access this data as if it were contained in the same system. Once we have the result, we save it in the Relational Database.
This project doesn't need any Apache Spark, Apache Hive or Relational Database running: everything is executed in memory.
This project assumes that both Java and SBT are installed.
Moreover, some ulterior assumptions are made based on the system you use.
- You need to have Administrator rights on your machine
- You need to have the winutils.exe binary on your machine, and you have to make sure that it is compatible with your system architecture (32- or 64-bit architecture)
- You need to set
HADOOP_HOME
to reflect the directory withwinutils.exe
- You need to set
PATH
environment variable to include%HADOOP_HOME%\bin
- You need to have Administrator rights on your machine. The
run.bat
file must be executed in a command-line window (cmd
) ran as Administrator, i.e. usingRun as administrator
option while executing cmd.
You can find detailed info on how to setup a Windows System here
You need to execute (preferably via CLI) one of the two run scripts included:
run.sh
(for *nix systems)run.bat
(for Windows systems)
The data consists of flight arrival and departure details for all commercial flights within the USA in 2008.
The Flight dataset is a modified version of the dataset provided by Dr. Leonore Findsen.
The Plane dataset is a modified version of the dataset provided by Project Mosaic.
Unless stated elsewhere, all files herein are licensed under the MIT license. For more information, please see the LICENSE file.