Skip to content

shaswat-dharaiya/ML-Spark

Repository files navigation

Wine Quality Prediction

1. Parallel training implementation

  • Create a cluster

We create a cluster of 5 nodes (1 master 4 slaves) to train a ML model to predict the quality of the wine. We use AWS EMR cluster management tool to achieve this.

EMR Cluster

  • Upload files to the s3 bucket

On creating the cluster, we will have access to the s3 bucket attached to the cluster.

S3 Bucket

Upload the target containing the jar file Trainer-1.0-SNAPSHOT.jar & dataset to the bucket.

target folder is generated when you type mvn-compile-package.

Folders

Login to the master instance & pull the files from s3 bucket.

aws s3 cp s3://aws-logs-612956122687-us-east-1/elasticmapreduce/j-33M8LH0D9HLMD/dataset/ ./ --recursive
aws s3 cp s3://aws-logs-612956122687-us-east-1/elasticmapreduce/j-33M8LH0D9HLMD/target ./target --recursive

S3_CP S3_LS

First we need to make these files available to the slave nodes.

hadoop fs -put TrainingDataset.csv
hadoop fs -put ValidationDataset.csv

We can confirm the accessibility of the files using

hdfs dfs -ls -t -R

confirm_files

Now we can start training. We use spark-submit command to run the jar file.

spark-submit target/Trainer-1.0-SNAPSHOT.jar

performance

This will create a model/ModelTrained folder and store trained model to it. Verify model is created by executing following :

hdfs dfs -ls -t -R

confirm_files1

Copy This folder back to our master node using following

hdfs dfs -copyToLocal model/ModelTrained /home/hadoop/wine_quality

Zip it

tar czf model.tar.gz wine_quality

Download the file to your local machine (sftp or filezilla).

2. Single machine prediction

Ssh into your ec2 instance. Make sure scala (v2) & spark (v2) are installed on your ec2.

Go to pom.xml file and change MainClass to Testor

<mainClass>Testor</mainClass>

Then recompile your project

Upload the target, dataset, & model.tar.gz to your ec2 Extract model.tar.gz

tar -xzvf model.tar.gz

Now we can start training. We use spark-submit command to run the jar file.

spark-submit target/Trainer-1.0-SNAPSHOT.jar

model_ec2

3. Docker container for prediction

Link to Docker Container

Now we're going to build our docker image & create a container out of it.

Make sure you are login to docker.io on your local machine

First, copy target, dataset, model files to a new folder (say docker)

Then run the command:

docker build -t srd22/wine_quality .

docker_build

Next publish the container:

docker push srd22/wine_quality .

docker_push

Then go to your ec2 and use following command to run the docker

docker run --mount type=bind,source=/home/ec2-user/TestingDataset.csv,target=/TestingDataset.csv srd22/wine_quality

docker_ec2

4. Terraform - Automation of the infrastructure

Set the following as Environment variables -AWS_ACCESS_KEY_ID_ROOT, AWS_SECRET_ACCESS_KEY_ROOT, AWS_REGION

  1. Run scripts/root_setup.sh script to perform root level environment setup.

  2. Run scripts/setup_train.sh script to create the EMR Cluster & Train the model.

  3. Run scripts/setup_test.sh script to test the model on an ec2.

5. CI/CD GitHub Action

Achieves CI/CD using Github Action that will retrain the model if there are some changes made to the source code:

on:
# Conditions for an event to be triggered.
  push:
    branches:
      - main
    paths: 
      - src/main/java/*

scripts/setup_train.sh will be executed to retrain the model.

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
  # This workflow contains a single job called "build"
  build:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@master
      - name: Sync & Train
        run: sh ./scripts/setup_train.sh
        env:
          AWS_REGION: ${{ secrets.AWS_REGION_USER }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_USER }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_USER }}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published