We are computer science & statistics students at LMU Munich and this project is happening as part of a Data Science Practical. Our plan was to participate in the associated Kaggle Challenge and subsequently build a product surrounding the trained model. Eventually we placed 160th in the public leaderboard and unfortunately only 509th in the private leaderboard.
Have fun checking out our stuff!
Cheers
- Q: How much does it cost to cool a skyscraper in the summer?
- A: A lot! And not just in dollars, but in environmental impact.
Thankfully, significant investments are being made to improve building efficiencies to reduce costs and emissions. So, are the improvements working? That’s where you come in. Current methods of estimation are fragmented and do not scale well. Some assume a specific meter type or don’t work with different building types.
Developing energy savings has two key elements: Forecasting future energy usage without improvements, and forecasting energy use after a specific set of improvements have been implemented, like the installation and purchase of investment-grade meters, whose prices continue to fall. One issue preventing more aggressive growth of the energy markets are the lack of cost-effective, accurate, and scalable procedures for forecasting energy use.
In this competition, you’ll develop accurate predictions of metered building energy usage in the following areas: chilled water, electric, natural gas, hot water, and steam meters. The data comes from over 1,000 buildings over a three-year timeframe.
With better estimates of these energy-saving investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficiencies.
The provided data consists of ~20 mio. rows for training (one year timespan) and ~40 mio. rows for testing (two years timespan). The target variable are the hourly readings from one of four meters {0: electricity, 1: chilledwater, 2: steam, hotwater: 3}. For building the model the data provides following features out of the box:
- building_id --> Foreign key for the building metadata.
- meter ---> The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, hotwater: 3}
- timestamp --> Hour of the measurement
- site_id --> Identifier of the site the building is on
- primary_use ---> Primary category of activities for the building
- square_feet --> Floor area of the building
- year_built ---> build year of the building
- floorcount - Number of floors of the building
Further weather data has been provided, which comes with air_temperature, cloud_coverage, dew_temperature, precip_depth_1_hr, sea_level_pressure, wind_direction and wind_speed.
-
The Data
The raw data has to be placed in
data/raw
. A good practice is to download the data via the Kaggle CLI.kaggle competitions download -c ashrae-energy-prediction mkdir -p data/raw unzip ashrae-energy-prediction.zip -d data/raw
-
Use the configuration file
The configuration settings are located in
src/config.yml
. This is important to customize feature engineering and model training. -
Prepare data for training
Using
make data
the data is loaded and the several.csv
-files will be joined to a consistent data frame. The result is saved indata/interim
. Next usemake features
to conduct the feature engineering process. The result is saved indata/processed
. -
Train a model
The frameworks being uses are LightGBM, CatBoost and XGBoost. We personally had our best experiences with LightGBM, but feel free to try differen frameworks or setting. The default settings are the parameters that have been determined through a hyperparameter search. To train a model use
make train MODEL=<framework> MODE=<mode>
. For theMODEL
parameter you can use lgbm (LightGBM), ctb (CatBoost) or xgb (XGBoost). All Models work withMODE=cv
(Cross Validation), which is our preferred way that gave us the best results. For LightGBM there are also following options available: full (training on whole dataset w/o validation set), by_meter (training a model by meter type), by_building (training a model by building id). The models will be safed in the equally named directory. -
Make a prediction
The easiest way is to use
make predict MODEL_PATH=<modelpath> MODEL=<model>
whereMODEL_PATH
should point to the directory of the saved models or the model itself. TheMODEL
parameter describes the framework of the model equivalent to the step above. The result is a.csv
file, which is dumped in thesubmission
directory and is ready for uploading to Kaggle. An importen flag is whether to use leaks or not as it heavily influences the resulting submission file. -
Submit to Kaggle
The submission file can be uploaded to the respective Kaggle challenge simply via the CLI tool.
kaggle competitions submit -c ashrae-energy-prediction -f submission.csv -m "<submit message>"
Keep in mind that your authentication reasons your Kaggle token has to be placed in
~/.kaggle
.
Unfortunately a portion of the test labels have been leaked, which stirred the whole competition. If you want to use the leaks for your own experiments, you have the set the respective flags in the config file. Additionally the leaks have to be downloaded from here and be placed in ./data/leaks
.
The aim of part two of this project focuses on the development of a web application, which must incorporate the predicted model from phase 1. Obviously this is now more or less a proof of concept, but the main idea is that users (e.g. housing associations) can enter their real estates and are able to predict energy consumption of these for a defined period of time. Hence time points with high energy needs can be identified. So far the forecast only includes a range of 5 days since this is the limit for free usage of our chosen weather api.
The app consists of a classic nginx/uwsgi/flask stack and is deployed as a composition of three microsvervices: the app itself (front end / necessary backend), Postgres Database for user/building data, LightGBM model served via REST api.Our app comes with a built in API to receive energy predictions. The endpoint is available at port 5001 and extension /predict
. To get a prediction you need to attach a json payload with following attributes. An example can be found here. The easiest way to obtain a fitting format is to convert a pandas dataframe to json using the built in methods.
Feature | Value |
---|---|
primary_use | Integer from 0 - 16 describing primary use of building. |
square_feet | Float describing logarithm of square feets of the building |
year_built | Integer describing the year built |
cloud_coverage | Float describing cloud coverage |
meter | Integer between 0 and 3 describing the desireg meter |
floor_count | Integer describing number of floors |
air_temperature | Float describing air temperature |
relative_humidity | Float describing relative humidity |
dew_temperature | Float describing dew temperature |
precip_depth_1_hr | Float describing hourly precipitation |
air_temp_f | Float describing air temperature in fahrenheit |
feels_like_temp | Float describing feels like temperature |
hour | Integer between 0 and 23 |
weekday | Integer between 0 and 6 |
area_per_floor | Float with square_feet / floor_count |
outlier_square_feet | Boolean if square feet is an outlier (usually no) |
outlier_area_per_floor | Boolean if area per floor is an outlier (usually no) |
air_temperature_6_lag | 6 hour Rolling Average |
air_temperature_24_lag | 24 hour Rolling Average |
dew_temperature_6_lag | 6 hour Rolling Average |
dew_temperature_24_lag | 24 hour Rolling Average |
cloud_coverage_6_lag | 6 hour Rolling Average |
cloud_coverage_24_lag | 24 hour Rolling Average |
Requirements:
- Docker 19.03 or above (for multi-stage builds)
Go into ./app
and run the command:
$ docker-compose up --build
Open http://localhost:80
and enjoy!
Don't forget to enter your openweathermap API key in config.py
.
To use the app, you must download the model from
(here) and save it as ./app/model/model.txt
.
If you have changed the pre-processing code, go to the project root run:
$ python setup.py bdist_wheel
$ cp dist/energeek-0.1.0-py3-none-any.whl app/app/packages/
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── app <- The top-level directory for all files regarding the web app.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_lgbm_model.py
│ │ └── train_ctb_model.py
| │ └── train_xgb_model.py
│ │ └── find_hyperparameter_lgbm.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org