Skip to content

Yassaadi/Scaling_cnn_emr

Repository files navigation

Scaling_cnn_emr

Yassine Assaadi: Deployment and scaling of Mobilenetv2 model

Table of Contents:

Introduction

1.1 Problem Statement

1.2 Objectives in this Project

1.3 Project Workflow

Selected General Technical Choices

2.1 Distributed Computing

2.2 Transfer Learning

Deployment of the Solution Locally

3.1 Working Environment

3.2 Spark Installation

3.3 Package Installation

3.4 Library Import

3.5 Definition of Paths for Loading Images and Saving Results

3.6 Creating the SparkSession

3.7 Data Processing

3.7.1 Loading Data

3.7.2 Model Preparation

3.7.3 Definition of the Image Loading Process and Application of Feature Extraction Using pandas UDF

3.7.4 Execution of Feature Extraction Actions

3.8 Loading Saved Data and Result Validation Deployment of the Solution in the Cloud

4.1 Cloud Provider Choice: AWS

4.2 Technical Solution Choice: EMR

4.3 Data Storage Solution Choice: Amazon S3

4.4 Environment Configuration

4.5 Uploading Data to S3

4.6 Configuring the EMR Server

4.6.1 Step 1: Software and Steps

4.6.1.1 Software Configuration

4.6.1.2 Modifying Software Parameters

4.6.2 Step 2: Hardware

4.6.3 Step 3: General Cluster Settings

4.6.3.1 General Options

4.6.3.2 Bootstrap Actions

4.6.4 Step 4: Security

4.6.4.1 Security Options

4.7 Instantiating the Server

4.8 Creating an SSH Tunnel to the EC2 Instance (Master)

4.8.1 Creating Permissions for Incoming Connections

4.8.2 Creating SSH Tunnel to the Driver

4.8.3 Configuring FoxyProxy

4.8.4 Accessing EMR Server Applications via SSH Tunnel

4.9 Connecting to JupyterHub Notebook

4.10 Code Execution

4.10.1 Starting the Spark Session

4.10.2 Package Installation

4.10.3 Library Import

4.10.4 Definition of Paths for Loading Images and Saving Results

4.10.5 Data Processing

4.10.5.1 Loading Data

4.10.5.2 Model Preparation

4.10.5.3 Definition of the Image Loading Process and Application of Feature Extraction Using pandas UDF

4.10.5.4 Execution of Feature Extraction Actions

4.10.6 Loading Saved Data and Result Validation

4.11 Tracking Task Progress with Spark History Server

4.12 Termination of EMR Instance

4.13 Cloning the EMR Server (if needed)

4.14 S3 Server Directory Structure at the End of the Project

Conclusion

Introduction

1.1 Problem Statement

The young AgriTech startup, named "Fruits!", is looking to offer innovative solutions for fruit harvesting. The company's goal is to preserve the biodiversity of fruits by enabling specific treatments for each fruit species through the development of intelligent harvesting robots. In order to gain recognition, the startup plans to initially provide the general public with a mobile application that allows users to take a photo of a fruit and obtain information about it. The startup believes that this application will raise awareness among the public about fruit biodiversity and serve as a first version of the fruit image classification engine. Furthermore, the development of the mobile application will help build an initial version of the required Big Data architecture.

1.2 Objectives in this Project

The objectives of this project are to develop an initial data processing pipeline that includes preprocessing and dimension reduction steps. It is important to consider that the volume of data will rapidly increase after the completion of this project. Therefore, the objectives include:

Deploying the data processing in a Big Data environment Developing scripts in PySpark to perform distributed computing

2.1 Distributed Computing

The project statement requires us to develop scripts in PySpark to account for the rapid increase in data volume after project delivery.

To quickly and simply understand what PySpark is and how it works, we recommend reading this article: PySpark: All You Need to Know about the Python Library.

The beginning of the article states the following: "When it comes to database processing in Python, the pandas library immediately comes to mind. However, when dealing with excessively large databases, calculations become too slow. Fortunately, there is another Python library, similar to pandas, that allows processing of very large amounts of data: PySpark. Apache Spark is an open-source framework developed by UC Berkeley's AMPLab, enabling the processing of massive databases using distributed computing, a technique that leverages multiple computing units distributed across clusters to reduce query execution time. Spark was developed in Scala and performs optimally in its native language. However, the PySpark library offers the ability to use Spark with the Python language while maintaining similar performance to Scala implementations. Therefore, PySpark is a good alternative to the pandas library when dealing with excessively large datasets that result in time-consuming calculations."

As we can see, PySpark is a way to communicate with Spark using the Python language. Spark, on the other hand, is a tool for managing and coordinating task execution on data across a group of computers. Spark (or Apache Spark) is an open-source in-memory distributed computing framework for processing and analyzing massive amounts of data.

Another highly informative and comprehensive article to understand how Spark works, as well as the role of Spark Sessions that we will use in this project, is available.

Here is an excerpt from the article: "Spark applications consist of a driver (the 'driver process') and multiple executors ('executor processes'). It can be configured to act as an executor itself (local mode) or to use as many executors as required to process the application. Spark supports automatic scaling by configuring a minimum and maximum number of executors.

Spark Architecture Diagram

The driver (sometimes referred to as the 'Spark Session') distributes and schedules tasks among the different executors, which execute the tasks and enable distributed processing. It is responsible for executing code on the different machines.

Each executor is a separate Java Virtual Machine (JVM) process that can be configured with the number of CPUs and allocated memory. Only one task can process a data partition at a time."

In both the local and cloud environments, we will use Spark and leverage it through Python scripts using PySpark.

In the local version of our script, we will simulate distributed computing to validate that our solution works. In the cloud version, we will perform operations on a cluster of machines.

2.2 Transfer Learning

The project statement also requires us to create an initial data processing pipeline that includes preprocessing and dimensionality reduction.

It is also mentioned that it is not necessary to train a model at this stage.

We have decided to use a transfer learning solution.

Simply put, transfer learning involves utilizing the knowledge already acquired by a pre-trained model (in this case, MobileNetV2) and adapting it to our problem.

We will provide our images to the model and retrieve the second-to-last layer of the model. The last layer of the model is a softmax layer used for image classification, which is not required for this project.

The second-to-last layer corresponds to a reduced-dimensional vector (1,1,1280).

This will allow us to create an initial version of the engine for fruit image classification.

MobileNetV2 has been chosen for its fast execution, which is particularly suitable for processing large volumes of data, as well as the low dimensionality of the output feature vector (1,1,1280).

Diapositive4 Diapositive5 Diapositive6 Diapositive7 Diapositive8 Diapositive9 Diapositive11 Diapositive12 Diapositive13 Diapositive14 Diapositive15 Diapositive16 Diapositive17 Diapositive18 Diapositive19

Conclusion

We completed this project in two phases, taking into account the constraints imposed on us. In the first phase, we developed our solution locally on a virtual machine in a Linux Ubuntu environment.

The initial phase involved setting up the Spark working environment. Spark has a parameter that allows us to work locally, simulating shared computing by considering each core of a processor as an independent worker. We worked on a smaller dataset to simply validate the proper functioning of the solution.

We chose to implement transfer learning using the MobileNetV2 model. This model was selected for its lightweight nature, fast execution, and the low dimensionality of its output vector.

The results were saved to disk in multiple partitions using the "parquet" format.

The solution worked perfectly in local mode.

The second phase involved creating a real computing cluster. The goal was to anticipate a future increase in workload.

The best choice was to use the Amazon Web Services provider, which allows us to rent computing power on-demand at an affordable cost. This service is called EC2 and falls under the Infrastructure-as-a-Service (IaaS) offerings.

We went further by using a higher-level service, Platform-as-a-Service (PaaS), by utilizing the EMR service. This allowed us to quickly instantiate multiple servers (a cluster) and have them installed and configured with various programs and libraries required for our project, such as Spark, Hadoop, JupyterHub, and the TensorFlow library.

In addition to being faster and more efficient to set up, we have the assurance of the solution's proper functioning, as it was previously validated by Amazon engineers.

We were also able to easily install the necessary packages on all machines in the cluster.

Finally, with very few modifications and even more straightforwardly, we were able to run our notebook as we did locally. This time, we executed the processing on all the images in our "Test" folder.

We opted for the Amazon S3 service to store the data for our project. S3 provides all the necessary conditions for storing and efficiently leveraging our data at a low cost. The allocated space is potentially unlimited, but the costs will depend on the space used.

It will be easy for us to handle an increase in workload by simply resizing our cluster of machines (horizontally and/or vertically as needed). The costs will increase accordingly but will remain significantly lower than the expenses incurred by purchasing hardware or renting dedicated servers

Releases

No releases published

Packages

No packages published