This is a capstone project associated with MLOps Zoomcamp, and it will be peer reviewed and scored.
The end goal of the project is to build an end-to-end machine learning project containing feature engineering, trainig, vallidation,tracking, modeel deployment,hosting and general engineering best practices aimed at making house price prediction.
This data set has 414 rows and 7 columns. It provides the market historical data set of real estate valuations which are collected from Sindian Dist., New Taipei City, Taiwan. This data set is recommended for learning and practicing your skills in exploratory data analysis, data visualization, and regression modelling techniques. Feel free to explore the data set with multiple supervised and unsupervised learning techniques. The Following data dictionary gives more details on this data set:
Column Position | Atrribute Name | Definition | Data Type | Example | % Null Ratios |
---|---|---|---|---|---|
1 | X1 transaction date | The transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.) | Qualitative | 2013.500, 2013.500, 2013.333 | 0 |
2 | X2 house age | The house age (unit: year) | Quantitative | 19.5, 13.3, 5.0 | 0 |
3 | X3 distance to the nearest MRT station | The distance to the nearest MRT station (unit: meter) | Quantitative | 390.5684, 405.21340, 23.38284 | 0 |
4 | X4 number of convenience stores | The number of convenience stores in the living circle on foot | Quantitative | 6, 8, 1 | 0 |
5 | X5 latitude | The geographic coordinate, latitude (unit: degree) | Quantitative | 24.97937, 24.97544, 24.94925 | 0 |
6 | X6 longtitude | The geographic coordinate, longitude (unit: degree) | Quantitative | 121.54243, 121.49587, 121.51151 | 0 |
7 | Y house price of unit area | The house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared) for example, 29.3 = 293,000 New Taiwan Dollar/Ping | Quantitative | 29.3, 33.6, 47.7 |
The architecture below depicts the system design:
Language, frameworks, libraries, Services and Tools used to bootstrap this project.
- : Container
- : Prediction service (web app)
- : s3 for storage,RDS as database, EC2 as virtual machine
- : Experiment tracking and model registry
- : Workflow orchestration
- : open source app framework in Python language
- : Monitoring
- : Monitoring Dashboard
- : Monitoring Database
- Pylint + Black + isort : Linter and code formaters
- Training , orchestration, Tracking, Model Registry & Deployment
make train
- Prediction service setup , Monitoring service setup, Integratin Test, Streamlit provisioning
make build
- Batch Prediction
python stream_send.py
- Prediction
[http:](http://localhost:8501)
The following is the resulting repo structure:
|-- Makefile
|-- README.md
|-- Test
| `-- integration_test
| `-- run.sh
|-- Tracking_Orchestration
| |-- Pipfile
| |-- Pipfile.lock
| |-- test.py
| |-- track.sh
| `-- train.py
|-- data
| |-- batch_test.csv
| |-- data.xlsx
| `-- train.csv
|-- images
| |-- MLFLOW_EXPER.PNG
| |-- deploy.PNG
| |-- docker.PNG
| |-- drift.PNG
| |-- mlflow_model.PNG
| |-- train.PNG
| `-- web_page_STREAMLIT.PNG
|-- pyproject.toml
|-- streamlit
| |-- Dockerfile
| |-- Pipfile
| |-- Pipfile.lock
| |-- frontend.py
| `-- images
| `-- house.jpg
`-- web_service_monitoring
|-- Pipfile
|-- Pipfile.lock
|-- docker-compose.yml
|-- evidently_service
| |-- Dockerfile
| |-- app.py
| |-- config
| | |-- grafana_dashboards.yaml
| | |-- grafana_datasources.yaml
| | `-- prometheus.yml
| |-- config.yaml
| |-- dashboards
| | |-- cat_target_drift.json
| | |-- classification_performance.json
| | |-- data_drift.json
| | |-- num_target_drift.json
| | `-- regression_performance.json
| |-- datasets
| | `-- train.csv
| `-- requirements.txt
|-- prediction_service
| |-- Dockerfile
| |-- app.py
| `-- requirements.txt
|-- requirements.txt
|-- stream_send.py
`-- test.py
13 directories, 46 files
I am extremely grateful for the time this set of wonderful people put in place to ensure we understood the various aspect of data and analytical engineering