ClassiFIRE

Samuel Alter's BrainStation 2023 Data Science Capstone Project, Spring 2023

Motivation

Wildfires, or the uncontrolled burning of vegetation, cause enormous damage and loss of life worldwide every year.¹
33 people died and >4.3 million acres burned in California in 2020 alone. This is equivalent to the total combined area of Puerto Rico and Rhode Island.
Although they are a natural component of some forest ecosystems, wildfire incidents are projected to increase in our warming world.²
While society transitions to a greener future, there is a clear need to predict where wildfires are likely to occur so that communities can protect themselves and evacuate when necessary.

Introduction

For this project, I collected a combination of satellite imagery, topographic data (aspect, elevation, and slope), and historical fire boundaries to train models that would predict wildfire risk (Figure 1).
I first used QGIS, an open-source mapping application, to setup the relationships between the datasets.
Then used Python to process and model the data.
I fed the processed data into machine learning and deep learning models to make my predictions.
I chose to focus on the Santa Monica Mountains in Ventura and Los Angeles Counties, for its relatively manageable size, proximity to populated areas, and its east-west trend, which makes the topographic analysis easier.

Datasets Used

Satellite imagery and topographic data from USGS’ EarthExplorer³
Historical records of wildfire burn boundaries from the National Interagency Fire Center⁴
File formats used: .tif and .jpg (imagery); .geojson and .csv (historical wildfire data).

Mapping, Data Cleaning, and Preprocessing

In order to train a model that integrates spatial information with historical fire boundaries, I had to find a way to quantify the continuous spatial information. I settled on building an evenly-spaced points layer that would have the topographic and fire boundary data from under that point appended to the file (Figure 2).
Since the satellite photos were from 2018, I removed all fire polygons from after 2018 and merged the rest to one shape.
Using the point layer, an overlapping function could determine if a part of the landscape experienced a fire (“fire” area) or not (“nofire” area). The resultant file would serve as the “topographic” data in the project.
For the geographic data, I focused on aspect (i.e. what direction the land is facing), elevation (meters above sea level), and slope (degrees). It is a straightforward operation to extract these data from an elevation raster and append them to the point layer.
I tiled the satellite imagery into 128x128 pixels that would be fed into a Tensorflow image model, and saved them into two folders (i.e., fire and nofire).
At the end of the data collection step, I had a table of geographic data and two image folders (fire areas and non-fire areas).

Figure 2: Mapping flow from QGIS to modeling

Modeling, Results, and Insights

Since data is in two forms (point-based topographic information and satellite image tiles), I trained two models. After finding an optimal model, I created a metamodel, which used the predictions from the topographic and image analysis models as factors for a new logistic regression (Figure 3).

Figure 3: Modeling pipelines using statsmodels, sklearn, Tensorflow (VGG19), and finally a metamodel with sklearn's logistic regression.

Topographic Data
- The areas affected by fire were predominantly located in mountainous regions with higher mean elevation compared to fire-free areas (Table 1).
- Correlations were observed between different topographic factors, such as categorical aspect with continuous aspect, and categorical slope and elevation with continuous elevation and slope.
- LogisticRegression was chosen as the modeling approach for classifying fire incidence based on topographic data, revealing the relative importance of different features.
- Various models were tested, including naive_bayes, Bernoulli and Gauss, XGBClassifier, RandomForestClassifier, AdaBoost, and GradientBoostingClassifier, with GradientBoostingClassifier achieving the highest accuracy.
- Grid search was employed to optimize the GradientBoostingClassifier model, but the best accuracy remained the same with specific parameter values.

Table 1: Summary statistics by fire/nofire areas

Satellite Imagery
- BigEarthNet⁵ suggested the pre-trained VGG19 model to be sufficient, so I used that with Tensorflow-Keras and added a final dense layer with two output nodes to represent the fire/nofire categories required by this project.
- With 20,025,410 total parameters, I deemed the model more than sufficient for the project’s needs.
- After training, the model achieved an accuracy of 95.6% on classifying all the images.
- Figure 4 shows typical images in the set as well as the locations of the two areas (fire and non-fire) that I used to feed the models.

Figure 4: examples of fire and nofire satellite images

Figure 3: Map of the sections within the study area used to feed the models

Metamodel
- To construct the metamodel, I extracted the predictions from the topographic and imagery datasets, used them as features, and ran the two through a scikit-learn LogisticRegression, which achieved over 99% accuracy. This concluded the modeling portion of the project.

Findings and Conclusions

The statsmodels Logit model reveals that higher elevations and west- and north-facing hillslopes correlate with wildfires, potentially due to dominant wind directions in the area.
The project serves as a proof-of-concept, demonstrating the capabilities of a machine-learning model using remotely-sensed data for predicting wildfire risk. Future iterations can expand by incorporating weather and time series analysis and testing the model on different landscapes.
Despite limitations, such as lack of high-resolution weather data, the project shows promise for developing a robust wildfire risk prediction model with broader data and automated GIS workflows.
Publishing the model in a more accessible manner could provide communities with valuable insights into their wildfire risk.

Postscript: Commentary on why the accuracy is so high

Accuracy is almost 100%, which is suspiciously high
Since the model was trained on either 100% burned areas or 100% unburned areas, it only knows the clearly-delineated cases.
Furthermore, when tested, the model was only given 100% burned or unburned areas
The unburned area was a city center. There would never be a wildfire in a concrete jungle
The opposite is similarly always true: in the mountains, away from all civilization, fires are extremely likely and are much harder to combat

Extra goodies

Correlation matrix of the geographic data:

Correlation matrix of the geographic data

Summary of models and results for TopoData:

Summary of models and results for TopoData

Summary of satellite imagery model:

Summary of metamodel:

Satellite image and historic wildfire boundaries (up to 2018):

Satellite image and historic wildfire boundaries (up to 2018)

Satellite image and hillshade of study area:

Satellite image and hillshade of study area

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
01_capstone_notebooks		01_capstone_notebooks
images		images
.gitignore		.gitignore
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClassiFIRE

Motivation

Introduction

Datasets Used

Mapping, Data Cleaning, and Preprocessing

Modeling, Results, and Insights

Findings and Conclusions

Postscript: Commentary on why the accuracy is so high

Extra goodies

About

Releases

Packages

Languages

sralter/classifire

Folders and files

Latest commit

History

Repository files navigation

ClassiFIRE

Motivation

Introduction

Datasets Used

Mapping, Data Cleaning, and Preprocessing

Modeling, Results, and Insights

Findings and Conclusions

Postscript: Commentary on why the accuracy is so high

Extra goodies

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages