Skip to content

Vincibean/SparkJdbcHiveIntegration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Spark - JDBC Hive Integration

A simple example of JDBC and Apache Hive integration in Apache Spark.

Table of Content

Use Case

Save relevant information for each delayed flight. A flight is considered delayed if the delay is greater than 15 minutes.

In particular, the following data must be saved:

  • tail number (i.e. the civil registration or military serial number)
  • aircraft type
  • construction year of the aircraft
  • flight time (i.e. how long the flight lasted)
  • delay
  • the ratio of delay to flight time

Project Description

In order to get all the required data, two datasets should be used:

  • the Flight dataset
  • the Plane dataset

Yet, these two datasets reside on two different systems:

  • the Flight dataset is contained in a structured file loaded into a Hive table
  • the Plane dataset is contained in a Relational Database

We need Apache Spark to load both datasets from the respective systems so that the ensuing query can access this data as if it were contained in the same system. Once we have the result, we save it in the Relational Database.

This project doesn't need any Apache Spark, Apache Hive or Relational Database running: everything is executed in memory.

Prerequisites

This project assumes that both Java and SBT are installed.

Moreover, some ulterior assumptions are made based on the system you use.

*nix Systems

  • You need to have Administrator rights on your machine

Windows Systems

  • You need to have the winutils.exe binary on your machine, and you have to make sure that it is compatible with your system architecture (32- or 64-bit architecture)
  • You need to set HADOOP_HOME to reflect the directory with winutils.exe
  • You need to set PATH environment variable to include %HADOOP_HOME%\bin
  • You need to have Administrator rights on your machine. The run.bat file must be executed in a command-line window (cmd) ran as Administrator, i.e. using Run as administrator option while executing cmd.

You can find detailed info on how to setup a Windows System here

How To Run It

You need to execute (preferably via CLI) one of the two run scripts included:

  • run.sh (for *nix systems)
  • run.bat (for Windows systems)

Dataset Description

The data consists of flight arrival and departure details for all commercial flights within the USA in 2008.

The Flight dataset is a modified version of the dataset provided by Dr. Leonore Findsen.

The Plane dataset is a modified version of the dataset provided by Project Mosaic.

License

Unless stated elsewhere, all files herein are licensed under the MIT license. For more information, please see the LICENSE file.

About

A simple example of JDBC and Hive integration in Spark

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published