This is a Machine Learning project that aims to predict the wather.
I am very interested in climate change and I thought that a model which predicts the rain could be interesting.
For this project I downloaded a Kaggle dataset which contains 10 years of daily weather observations in Australia.
I hope that this kind of projects can help me develop a more sophisticated model in the future to predict other kind of events related to weather and climate change.
The idea is to predict if tomorrow will be raining, depending on the observations we do today. The dataset contains information from Australia so this model will be specifically predicting if it will rain tomorrow in a specific Australian city.
The dataset contains many variables like humidity, temperature, wind speed and the target variable to predict which is RainTomorrow. Rain tomorrow is a variable associated to each day and it states if it rained or not for each observation in the dataset.
Training our model with all the features can help predict which type of day is the most likely that leads to a rainy tomorrow.
URL to access the service: midterm-project-env.eba-myupfmwp.eu-north-1.elasticbeanstalk.com
Dataset link: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package
The model is trained with Gradient Boosting algorithm in a dataset composed of 145460 observations. The dataset was last updated 3 years ago.
- Readme.md: with description of the problem and instructions on how to run the project
- weatherAUS.csv: data used in the project.
- notebook.ipynb: python notebook with:
- Data preparation and data cleaning
- EDA and feature importance analysis
- Model selection process and parameter tuning
- train.py: script that trains the model and saves it to a model with pickle
- predict.py: script that loads the model and serves it via a web service with flask
- test.py: script that contains a possible day that is used to test the model and predict the next day.
- Pipfile and Pipfile.lock: files with the library dependencies
- Dockerfile with the instructions to build the docker image
I recorded a video on how to run the project.
- I first created a notebook called notebook.ipynb where I downloaded the data, explored, prepared the data, cleaned, run different models with different parameters, evaluated them and concluded which was the model that performed the best.
- Then I generated the train.py script that trains the models and saves the model to a file with pickle, the predict.py file that loads the model and serves it and the test.py script that will be used to predict a specific day.
- Then I created and environment and installed the libraries I will be using:
pipenv install numpy scikit-learn flask gunicorn xgboost
- Then I run the environment with:
pipenv shell
- I run a server locally:
gunicorn --bind 0.0.0.0:9696 predict:app
- I test that the model is working with:
python3 test.py
- After checking that it works, I create a docker container
sudo docker build -t midterm_project .
. We can test it running the docker imagedocker run -it --rm midterm_project
and executingpython3 test.py
- Finally I deploy it to AWS with Elastic Beanstalk. For that I first install the library
pipenv install awsebcli --dev
, initialize EBeb init -p docker -r eu-north-1 midterm_project
and create the serviceeb create midterm-project-env
- Now we just need to test it. For that, I modified the line pointing to the url in
test.py
and runpython3 test.py
. (There is no need to change it now since the server is still running in AWS (today being 5 November 2023))
I had some problems with getting the feature names from the DictVectorizer.
In the course we used dv.get_feature_names()
but I got errors on my end and I had to change to list(dv.get_feature_names_out())
.
After reading it looks like it has to do with different scykit versions.
I also read that get_feature_names()
is being replaced to get_feature_names_out()
in the library so I kept it like that.