This repository contains a Machine Learning (ML) pipeline which predicts message categories in disaster situations. It is precisely during disaster situations, that the response organizations have the least capacity to evaluate and react properly to each message that arrives to them (via direct contact, social media, etc.). In this project, NLP is applied and a classification model is trained so that the category of each message can be predicted automatically; then, the messages can be directed to the appropriate relief agencies. In total 36 message categories are predicted, which are related to the type of possible emergency, e.g., earthquake
, fire
, missing_people
, etc.
All in all, the following methods/techniques are implemented and documented:
- An ETL pipeline (Extract, Transform, Load).
- A Machine Learning pipeline which applies NLP to messages and predicts message categories.
- Testing (Pytest) and linting.
- Error checks, data validation with Pydantic and exception handling.
- Logging.
- Continuous Integration with Github Actions.
- Python packaging.
- Containerization (Docker).
- Flask web app, deployed locally.
The Flask web app enables interaction, i.e., the user inputs a text message and the trained classifier predicts candidate categories:
I took the starter
code for this project from the Udacity Data Scientist Nanodegree and modified it to the present form, which deviates significantly from the original version.
- Disaster Response Pipeline
The directory of the project consists of the following files:
.
├── Instructions.md # Original challenge/project instructions
├── README.md # This file
├── app # Web app
│ ├── run.py # Implementation of the Flask app
│ └── templates # Web app HTML/CSS templates
│ ├── go.html
│ └── master.html
├── assets/ # Images, etc.
├── data # Datasets
│ ├── DisasterResponse.db # Generated database
│ ├── categories.csv # Catagories dataset
│ └── messages.csv # Messages dataset
├── disaster_response # Package
│ ├── __init__.py
│ ├── file_manager.py # General structures, loading/persistence manager
│ ├── process_data.py # ETL pipeline
│ └── train_classifier.py # ML pipeline and training
├── main.py # Script which runs both pipelines: ETL and ML (training)
├── models # Inference and evaluation artifacts
│ ├── classifier.pkl # Trained pipeline (not committed)
│ └── evaluation_report.txt # Evaluation metrics: F1, etc.
├── disaster_response_pipeline.log # Logs
├── notebooks # Research notebooks
│ ├── ETL_Pipeline_Preparation.ipynb
│ └── ML_Pipeline_Preparation.ipynb
├── config.yaml # Configuration file
├── conda.yaml # Conda environment
├── requirements.txt # Dependencies for pip
├── Dockerfile # Docker image definition
├── docker-compose.yaml # Docker compose YAML
├── run.sh # Execution script for Docker
├── setup.py # Package setup
├── starter/ # Original starter material
└── tests # Tests
├── __init__.py
├── conftest.py # Pytest configuration, fixtures, etc.
└── test_library.py # disaster_response package tests
To run the pipelines and the web app, first the dependencies need to be installed, as explained in the next section. Then, we can execute the following commands:
# This runs the the ETL pipeline, which creates the DisasterResponse.db database
# It also runs the ML pipeline, which trains the models and outputs classifier.pkl
# WARNING: The training might take some hours, because hyperparameter search
# with cross-validation is performed.
python main.py
# Spin up the web app
# Wait 10 seconds and open http://localhost:3000
# We see some visualizations there; if we enter a message,
# we should get the predicted categories.
python app/run.py
Notes:
main.py
usesconfig.yaml
; that configuration file defines all necessary parameters for both pipelines (ETL and ML training). However, some parameters can be overridden via CLi arguments — trypython main.py --help
for more information.⚠️ The training might take some hours, because hyperparameter search with cross-validation is performed.- The outputs from executing both pipelines are the following:
DisasterResponse.db
: cleaned and merged SQLite database, product of the ETL process.classifier.pkl
: trained classifier, used by the web app.evaluation_report.txt
: evaluation metrics of the trained classifier.
You can create an environment with conda and install the dependencies with the following recipe:
# Create environment with YAML, incl. packages
conda env create -f conda.yaml
conda activate dis-res
pip install . # install the disaster_response package
# Alternatively, if you prefer, create your own environment
# and install the dependencies with pip
conda create --name dis-res pip
conda activate dis-res
pip install -r requirements.txt
pip install . # install the disaster_response package
Note that both conda.yaml
and requirements.txt
contain the same packages; however, requirements.txt
has the specific package versions I have used with Python 3.9.16
.
The dataset is contained in the folder data
, and it consists of the following files:
messages.csv
: a CSV of shape(26248, 4)
, which contains the help messages (in original and translated form) as well as information on the source.categories.csv
: a CSV of shape(26248, 2)
which matches each message id frommessages.csv
with 36 categories, related to the type of disaster message. All categories are in text from in one column. All those target categories are listed inconfig.yaml
.
The notebooks
provide a good first exposure to the contents of the datasets. After running the ETL pipeline, the SQLite database DisasterResponse.db
is created, which contains a clean merge of the aforementioned files.
In the following subsections, information on different aspects of the implementation is provided.
The Machine Learning (ML) functionalities are implemented in this package, which can be used as shown in main.py
. The package consists of the following files:
file_manager.py
: loading, validation and persistence manager.process_data.py
: ETL pipeline.train_classifier.py
: ML/training pipeline.
Having a file loading/validation/persistence manager makes the other modules more clear, abstracts the access to 3rd party modules and improves maintainability.
The ETL (Extract, Transform, Load) pipeline implemented in process_data.py
performs the following tasks:
- Load the source CSV datasets from
data
. - Clean and merge the datasets:
- Transform categories into booleans.
- Check that category values are correct.
- Drop duplicates and NaNs.
- Save the processed dataset into a SQLite database to
DisasterResponse.db
(or the filename defined inconfig.yaml
).
We can interact using SQL with the SQLite database DisasterResponse.db
produced by the ETL pipeline via CLI if we install sqlite3
:
cd data
# Enter SQLite terminal
sqlite3
# Open a DB
.open DisasterResponse.db
# Show tables
.tables # Message
# Get table info/columns & types
PRAGMA table_info(Message);
# Get first 5 entries
SELECT * FROM Message LIMIT 5;
# ...
# Exit SQLite CLI terminal
.quit
For more information on how to interact with relational/SQL databases using python visit my sql_guide.
The ML training pipeline implemented in train_classifier.py
loads the dataset from DisasterResponse.db
and fits a RandomForestClassifier
using GridSearchCV
. Since we have multiple targets, the random forest is wrapped with a MultiOutputClassifier
; as stated in the Scikit-Learn documentation,
The strategy [of a
MultiOutputClassifier
] consists in fitting one classifier per target.
Thus, the training might extend some hours, specially because we perform cross-validation and hyperparameter tuning. The final output is composed by two files placed in the folder models
:
classifier.pkl
: trained classifier, serialized as a pickle.evaluation_report.txt
: evaluation metrics of the trained classifier; for each target a classification report is provided (with F1 metrics).
The classifier.pkl
contains not only the model, but also the data processing pipeline that transforms the features and the targets to train the model. More details on that can be found in the associated function build_model()
from train_classifier.py
.
🚧 Notes:
- The focus of the project doesn't lie at this stage on optimizing the model, but instead, on creating an MVP of the app; future revisions should improve the model performance.
- The message category distribution (i.e., the target counts) is very imbalanced, as shown in the next figure (and future work should address that, too):
The Flask web app is implemented in app/run.py
. It consists of two routes that render one page each:
- The index/default page visualizes 3 plots of the data with Plotly; it also offers an input box for the user to insert a message to be classified.
- The classification page appears when the user hits the "classify" button and the model predicts the categories.
More information on how to create Flask dashboards with Plotly: data_science_udacity.
Once we have Pytest installed, we can run the tests as follows:
pytest tests
The tests
folder contains these two files:
tests/conftest.py
: configuration and fixtures definition.tests/test_library.py
: tests of functions defined in thedisaster_response
package.
🚧 Currently, very few and shallow tests are implemented; even though the loading/persistence module file_manager.py
validates many objects with pydantic and error-detection checks, the tests should be extended.
I have implemented Continuous Integration (CI) using Github Actions. The workflow file python-app.yml
performs the following tasks every time we push changes to the main
branch:
- Requirements are installed.
flake8
is run to lint the code; note that.flake8
contains the files/folders to be ignored.- Tests are run as explained above:
pytest tests
.
Containerization is a common step before deploying/shipping an application. Thanks to the simple Dockerfile
in the repository, we can create an image of the web app and run it as a container as follows:
# Build the Dockerfile to create the image
# docker build -t <image_name[:version]> <path/to/Dockerfile>
docker build -t disaster_response_app:latest .
# Check the image is there: watch the size (e.g., ~1GB)
docker image ls
# Run the container locally from a built image
# Recall to: forward ports (-p) and pass PORT env variable (-e), because run.sh expects it!
# Optional:
# -d to detach/get the shell back,
# --name if we want to choose conatiner name (else, one randomly chosen)
# --rm: automatically remove container after finishing (irrelevant in our case, but...)
docker run -d --rm -p 3000:3000 -e PORT=3000 --name disaster_response_app disaster_response_app:latest
# Check the API locally: open the browser
# WAIT 30 seconds...
# http://localhost:3000
# Use the web app
# Check the running containers: check the name/id of our container,
# e.g., census_model_app
docker container ls
docker ps
# Get a terminal into the container: in general, BAD practice
# docker exec -it <id|name> sh
docker exec -it disaster_response_app sh
# (we get inside)
cd /opt/disaster_response_pipeline
ls
cat disaster_response_pipeline.log
exit
# Stop container and remove it (erase all files in it, etc.)
# docker stop <id/name>
# docker rm <id/name>
docker stop disaster_response_app
docker rm disaster_response_app
Alternatively, I have written a docker-compose.yaml
YAML which spins up the one-container service with the required parameters:
# Run contaner(s), detached; local docker-compose.yaml is used
docker-compose up -d
# Check containers, logs
docker-compose ps
docker-compose logs
# Stop containers
docker-compose down
Note: in order to keep image size in line, .dockerignore
lists all files that can be avoided, similarly as .gitignore
.
- Add logging.
- Lint with
flake8
andpylint
. - Deploy it, e.g., to Heroku or AWS; another example project in which I have deployed the app that way: census_model_deployment_fastapi.
- Extend tests; currently, the test package contains very few tests that serve as blueprint for further implementations.
- Add type hints to
process_data.py
andtrain_classifier.py
; currently type hints andpydantic
are used only infile_manager.py
to clearly define loading and persistence functionalities and to validate the objects they handle. - Optimize properly the machine learning model, improving its performance:
- Try alternative models.
- Perform a through hyperparameter tuning (e.g., with Optuna).
- Address the imbalanced nature of the dataset.
- Add more visualizations to the web app.
- Based on the detected categories, suggest organizations to connect to.
- Improve the front-end design.
- My personal notes on the Udacity MLOps nanodegree: mlops_udacity.
- My personal notes on the Udacity Data Science nanodegree: data_science_udacity
- Notes on how to transform research code into production-level packages: customer_churn_production.
- My summary of data processing and modeling techniques: eda_fe_summary.
Mikel Sagardia, 2022.
No guarantees.
If you find this repository useful, you're free to use it, but please link back to the original source.