This project is a part of the MLOps Zoomcamp Project. The aim of the project is to build an end-to-end mlops pipeline.
You can find the demo of the entire project here!
- Problem Statement
- Pre-requisites
- Project Directory Structure
- Infrastructure
- Setup EC2
- Exploratory Data Analysis and Modeling
- Deployment
- Model Monitoring
- Retraining
- CI/CD Pipline
- Best Practices
- Tests
- User Interface
Social Media Texts have been used extensively to understand various events and their impact on the society. In this project we will build a text analyzer that will be able to classify social media text into two categories: Disastrous
and Non-Disastrous
. The model is trained on a kaggle dataset from "Natural Language Processing with Disaster Tweets Competition".
Classifying tweets as related to natural disasters or not can be valuable for several reasons:
-
Early Detection and Response: Social media platforms like Twitter are often used to share real-time information during natural disasters. By classifying tweets, emergency response teams and authorities can quickly identify emerging situations and allocate resources more effectively.
-
Situational Awareness: Monitoring tweets can provide insights into the scope, intensity, and impact of a natural disaster. This information can aid in understanding the situation on the ground and making informed decisions.
-
Public Safety Alerts: During natural disasters, authorities can use Twitter to send alerts and warnings to affected populations. Accurate classification ensures that relevant alerts reach the right people.
-
Resource Allocation: By analyzing tweets, organizations can understand the needs of affected communities and allocate resources such as food, water, medical supplies, and shelter accordingly.
-
Disaster Recovery: After a disaster, tweets can provide insights into the ongoing recovery efforts, the needs of survivors, and areas that require additional support.
The following tools are required to run the project:
You will also need an AWS account, Terraform Cloud account, and Prefect Cloud account.
To setup the AWS CLI you'll need to create a IAM user with the following permissions: AmazonAPIGatewayAdministrator
, AmazonEC2FullAccess
, AmazonRDSFullAccess
, AmazonS3FullAccess
, AWSLambda_FullAccess
, CloudWatchFullAccess
, IAMFullAccess
, NetworkAdministrator
. Also create and attach a policy which give full access to ECR.
This user acts as the admin for the project and will be used to create the infrastructure. However when the infrastructure is created each service will have its own IAM role with the least required permissions.
Next, you'll need to create an access key for the user. This will give you the AWS_ACCESS_KEY
and AWS_SECRET_ACCESS_KEY
which you'll need to configure the AWS CLI. You can configure the AWS CLI using the command aws configure
. You'll need to provide the AWS_ACCESS_KEY
and AWS_SECRET_ACCESS_KEY
along with the AWS_REGION
and AWS_OUTPUT_FORMAT
.
Create AWS EC2 key pair and download the .pem
file. This will be used to ssh into the EC2 instance. Follow the instructions here to create the key pair. Save the .pem
file in the ~/.ssh
directory.
In the file terraform/moddules/ec2_rds/variables.tf
update the key_name
variable with the name of the key pair you created at line 44
.
In the ~/.ssh/config file add the following lines:
Host ec2
HostName <ec2_public_ip>
User ec2-user
IdentityFile ~/.ssh/<key_pair_name>.pem
There are some files that might be missing from the github directory structure but are created for development purpose and are not pushed to github. These files are:
workflow.secrets
, workflow.vars
, and secrets.tfvars
in the .github/workflows
directory.
.
├── .github
│ ├── workflows
│ └── workflow.secrets
│ └── workflow.vars
├── Makefile
├── Pipfile
├── Pipfile.lock
├── README.md
├── data
│ ├── raw
│ └── submission.csv
├── deployment
│ ├── Dockerfile
│ ├── Pipfile
│ ├── Pipfile.lock
│ └── app
├── gradio-app
│ └── app.py
├── monitoring
│ ├── config
│ ├── dashboards
│ ├── data
│ ├── docker-compose.yaml
│ ├── evidently_grafana_metrics.py
│ ├── models
│ └── notebooks
├── notebooks
│ ├── exploratory-data-analysis.ipynb
│ └── modeling.ipynb
├── prefect.yaml
├── pyproject.toml
├── terraform
│ ├── main.tf
│ ├── modules
│ ├── outputs.tf
│ └── variables.tf
├── tests
│ ├── integration_tests
│ └── unit_tests
└── training
├── prefect.yaml
├── re-train.py
└── utils
The project is deployed on AWS using the following services:
- AWS S3 for storing the data and model artifacts.
- AWS RDS as the MLflow tracking server.
- AWS Lambda for running the inference code.
- AWS API Gateway for creating the API endpoint.
- AWS ECR for storing the docker image.
- AWS EC2 for building the project.
- AWS IAM for managing the permissions.
The infrastructure is managed using Terraform. The Terraform code is located in the terraform
directory.
Below are some terraform cloud comamnds:
- Login to Terraform Cloud:
terraform login
- Create new workspace:
terraform workspace new <workspace_name>
- Select workspace:
terraform workspace select <workspace_name>
- List workspaces:
terraform workspace list
- Delete workspace:
terraform workspace delete <workspace_name>
- Show workspace:
terraform workspace show
- Open terminal or command prompt and move to the
terraform
directory. - Login to Terraform Cloud using the command
terraform login
. - Create three workspaces:
dev
,staging
, andprod
. Examples:terraform workspace new dev
. We'll work in thedev
workspace. - For each workspace in the terraform cloud, change the
Execution Mode
toLocal
.- Open the browser and go to the terraform cloud.
- Select the workspace, Click on
Settings
. - In
General
tab, change theExecution Mode
toLocal
, and click onSave Settings
.
- In the
terraform/modules/vars
directory create a filesecrets.tfvars
that will contain the following values:aws_access_key
=<AWS_ACCESS_KEY>
aws_secret_key
=<AWS_SECRET_ACCESS_KEY>
db_username
=<DB_USERNAME>
# RDS postgres usernamedb_password
=<DB_PASSWORD>
# RDS postgres password
- In the terminal or command prompt, run the command
terraform init
. - Terraform plan:
terraform plan -var-file="./modules/vars/dev.tfvars" -var-file="./modules/vars/secrets.tfvars"
- Terraform apply:
terraform apply -var-file="./modules/vars/dev.tfvars" -var-file="./modules/vars/secrets.tfvars"
- Terraform destroy:
terraform destroy -var-file="./modules/vars/dev.tfvars" -var-file="./modules/vars/secrets.tfvars"
We'll move to EC2 instance to work on the rest of the project. You can connect to the EC2 instance using the using VS Code Remote SSH extension or following command:
ssh ec2-user@ec2
-
Git clone the project on the EC2 instance using the
git clone
command. -
Make sure
make
is installed on the EC2 instance. If not, install it using the following command:
sudo yum install make
sudo yum install git # Also install git if not installed
### Install all the tools and dependencies
```bash
make install-software
make setup
The exploratory data analysis and modeling is done in the notebooks
directory. The exploratory data analysis is done in the exploratory-data-analysis.ipynb
notebook. The modeling is done in the modeling.ipynb
notebook.
The project root folder contains the Pipfile
and Pipfile.lock
files. The Pipfile
contains the list of all the dependencies. The Pipfile.lock
contains the exact versions of the dependencies. The Pipfile
and Pipfile.lock
files are used by the pipenv
tool to create a virtual environment and install all the dependencies.
To create a virtual environment and install all the dependencies, run the following command:
pipenv install
To activate the virtual environment, run the following command:
pipenv shell
The MLflow tracking server is deployed on AWS RDS. The MLflow tracking server is used to track the model training runs. Also, MLflow artifacts are stored on AWS S3.
To start the MLflow tracking server, run the following command:
mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://DB_USER:DB_PASSWORD@DB_ENDPOINT:5432/DB_NAME --default-artifact-root s3://S3_BUCKET_NAME
You can find the DB_USER and DB_PASSWORD in the secrets.tfvars
file. You can find the DB_ENDPOINT on the AWS RDS console. ENDPOINT format would be <DB_NAME>.<RANDOM_STRING>.<REGION>.rds.amazonaws.com
. You can find the DB_NAME on the AWS RDS console.
Note: Add port 5000 for port forwarding in VS Code.
Below you can find the MLflow UI screenshot of the model training runs:
MLflow Logged Model:
You can view in the below image that the model is logged in the S3 bucket.
MLflow Model Registry:
The deployment is done using AWS Lambda and AWS API Gateway. We deploy the docker image to AWS ECR which is used by the AWS Lambda function. The AWS Lambda function is invoked by the AWS API Gateway endpoint.
To deploy our model, we'll need to convert our notebooks to python scripts. Also, create python environment file Pipfile
and Pipfile.lock
with all the dependencies. Then we'll create a docker image and push it to AWS ECR.
All the deployment related code is located in the deployment
directory.
Following best practices for deploying machine learning models, we don't have to manually create the image and push it to AWS ECR. We'll use the CI/CD pipeline to automate the deployment process. More on this later.
Till now what all we performed was on the cloud infrastructure. However, the current and the next section will be performed on the EC2 instance locally.
Due to the time constraint, I was not able to automate the model monitoring process. However, I have written the code for model monitoring. The code is located in the monitoring
directory.
We'll be using Evidently
library for model monitoring. Evidently is an open-source Python library to monitor machine learning models. It provides interactive reports to monitor the behavior of your model over time. It also provides a set of metrics to detect model performance drift and data drift.
To keep track of the model performance and data drift, we'll build a dashboard using Grafana
.
The idea behind model monitoring is that we have already trained a model on the training data. However, over time new data will be generated. We'll use the new data to monitor the model performance and data drift. If the model performance and data drift are not within the threshold, we'll retrain the model.
In evidently referenced data
is the data that was used to train the model. The current data
is the new data that is generated over time. The current data is used to monitor the model performance and data drift.
To view the model monitoring dashboard, run the following command:
docker-compose up
You can view the dashboard on the following URL:
http://localhost:3000
Note: Allow port 3000 for port forwarding in VS Code.
If you come across an error Error bind: address already in use
, run the following command:
sudo ss -lptn 'sport = :5432'
sudo kill <PID>
The dashboard will look like the following:
I have created three categories of metrics to monitor: Text Summary Metrics
, Data Drifts
and Metrics
.
Text Summary Metrics:
- It provides summary metrics on the textual data such as
Number of Missing values
,Mean text length
,Out of vocabulary words %
andNon-letter characters %
.
Data Drifts:
- Similar to above metrics, it checks data drifts for
Text Length
,Out of vocabulary words
andNon-letter characters
.
Metrics:
- It provides metrics such as
Accuracy
,Precision
,Recall
, andF1 Score
.
The idea behind retraining is that when the model performance degrades or the data drift is not within the threshold, we'll retrain the model. We'll use the new data to retrain the model.
The retraining code is located in the training
directory.
As the project is not live and we don't have new data, I have written the code to retrain the model on the training data. However there are placeholders where you can change the path to the new data.
To retrain the model, run the following command:
python re-train.py
Ideally we should automate the re-training process. We'll use Prefect
to automate the re-training process. Prefect is a workflow management system that makes it easy to build, run, and monitor data workflows.
We have wrapped our re-training code in a Prefect flow and deployed it to Prefect Cloud. We can schedule the flow to run at a specific time interval or we overshoot the threshold.
To create a Prefect deployment you would need to update the repository
value in the prefect.yaml
file to your repository.
In your prefect cloud account create a work pool with the following configuration:
- Sign in to your Prefect Cloud account.
- Click on the
Workflows
tab. - Click on the
+
icon to create a new work pool. - Select
Local Subprocess
as the Infrastructure Type. - Click on
Next
, enter the name of the work pool and click onCreate
.
Create a Work-Queue in the same work-pool with the following configuration:
- Click on the
WorkQueues
tab. - Click on the
+
icon to create a new work queue. - Enter the name of the work queue and click on
Create
.
Run the following command:
# Login to prefect cloud
prefect cloud login
# Deploy the flow to prefect cloud
prefect deploy --name re-train-job
To answer the question:
- Select flow_name: start_training
- Select work_pool_name: text-analyzer
- Is <github-url/username/text-analyzer> the correct URL to pull your flow code from? [y/n] (y): y
- Is feature the correct branch to pull your flow code from? [y/n] (y): n
- Please enter the branch to pull your flow code from (main): main
- Is this a private repository? [y/n]: n
- Would you like to save configuration for this deployment for faster deployments in the future? [y/n]: y
# Start the worker
prefect worker start --pool text-analyzer --work-queue re-train
# Run the deployment
prefect deployment run 'Train Model/re-train-job'
Below is the screenshot of the Prefect Deployment:
Working on a project usually involves multiple environments such as development, staging, and production. To replicate this practice in our project, we'll create environments in github. We'll create three environments: dev
, stg
, and prod
.
You can do this by going to the Settings
tab in your repository and then clicking on Environments
.
Since we are using CI/CD pipelines to deploy our application in different environments we'll create Github Secrets and Variables to store the credentials and other information.
Create Repository Secrets with the following names and appropriate values based on your configuration:
AWS_ACCESS_KEY_ID
: AWS Access Key IDAWS_DEFAULT_REGION
: AWS Default RegionAWS_SECRET_ACCESS_KEY
: AWS Secret Access KeyEXPERIMENT_ID
: MLflow Experiment IDRUN_ID
: MLflow Run IDGH_TOKEN
: Github TokenTF_API_TOKEN
: Tensorflow API Token
Create Environment level secrets and variables with the following names and values based on your configuration:
Each environment will have the following secrets and variables:
Secrets:
DB_PASSWORD
: Database PasswordDB_USERNAME
: Database Username
Variable:
TF_WORKSPACE
:dev
orstg
orprod
for respective environments
Additionally, the staging and production environment will have the following secrets:
Staging:
DEV_MODEL_REGISTRY
: 'mlops-zc-ta-dev-model-registry'
Production:
STG_MODEL_REGISTRY
: 'mlops-zc-ta-stg-model-registry'
This is used to take sync the model from the development/staging environment and deploy it to the staging/production environment.
Always use a separate branch for development and push the changes to the feature
branch. feature
branch is used for development and testing.
If the functionality is working as expected, pipeline automatically pushes the changes to the staging
branch where the changes are tested in the staging environment.
If the functionality is working as expected in the staging environment, pipeline automatically create a pull request to the main
branch where the changes are required to be merged manually.
Once the changes are merged to the main
branch, pipeline automatically deployes the application to the production environment.
While working on the project, I found a great tool called act
which allows you to run your GitHub Actions locally. This is a great tool to test your GitHub Actions locally before pushing the changes to the repository. Check out the repository here
To make act
function locally you would need to create two files in your local at .github
folder. These two files are called workflow.secrets
and workflow.vars
to store the secrets and variables used in the github action workflow.
workflow.secrets
contains:AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
,AWS_DEFAULT_REGION
,TF_API_TOKEN
,DB_USERNAME
,DB_PASSWORD
,EXPERIMENT_ID
,RUN_ID
,GH_TOKEN
workflow.vars
contains:TF_WORKSPACE
Example to run a job locally: act -j build-infrastructure --secret-file=.github/workflow.secrets --var-file=.github/workflow.vars
Following the best practices of software development, I have created a Makefile
to automate the quality checks and other tasks.
Quality Checks includes the following:
black
: Code Formatterpylint
: Code Linterisort
: Import Sortertrailingspaces
: Trailing Whitespace Removerend-of-file-fixer
: End of File Fixercheck-yaml
: YAML Linter
To run the quality checks, run the following command:
make quality-checks
We have also incorporated pre-commit
hooks to run the quality checks before every commit. To install the pre-commit
hooks, run the following command:
make setup
We have written unit tests located in the tests
directory. To run the unit tests, run the following command:
make test
We have created a simple user interface using Gradio
and deployed on Hugging Face Spaces.
Screenshot of the application in the production environment:
Non-Disastrous Tweet:
Disastrous Tweet: