Skip to content

Latest commit

 

History

History
68 lines (46 loc) · 4.05 KB

README.md

File metadata and controls

68 lines (46 loc) · 4.05 KB

Customer360Pipeline

Description

A data pipeline that is built for filtering closed order data and pushing the data to Hive and Hbase on a scheduled basis with a notification for success and failure of a pipeline

image

Technology Stack:

Python

Docker

HBase

Airflow



Hadoop

Hive

Slack

AWSEC2


Steps involved in the Pipeline

  1. Fetching the orders data from S3 bucket
  2. Creating a customers info table in mysql by loading the data from customers info file [link]
  3. Loading the customers information from mysql database to hive using sqoop
  4. Filtering the orders data for closed orders by processing the data with spark
  5. Creating table for the closed orders in hive
  6. Joining the closed orders table along with customers table in hive (The data is stored in hbase it is possible because of the hive-hbase integration)
  7. Send success or failure notification to slack channel

Installation

  1. Installing hadoop,hive,mysql db and hbase in an EC2 instance and creating a connection id in airflow for executing the pipeline in EC2 instance

    i. Refer to this article for hadoop installation on ubuntu

    ii. Refer to this article for hive installation on ubuntu

    iii. Refer to this article for mysql installation on ubuntu

    iv. Refer to this article for hbase installation on ubuntu

  2. Installing docker for running airflow container

    i. Refer to this documentation for installing docker

    ii. For running airflow container use the docker-compose.yaml

        
         # Clone the repo and open a terminal and cd into repo folder and run the following command
    
         docker-compose up
  3. Create a SSH connection id for connecting to EC2 instance refer to this article (Note: go to airflow-ui for accessing the Airflow UI)

  4. Create a Slack webhook integration and configure slack webhook connection in airflow (To create a connection goto admin section in Airflow UI and click on connections)

    i. Refer to this article for creating slack webhook

    ii. Refer to this article for configuring slack webhook in airflow