Samuel Alter's BrainStation 2023 Data Science Capstone Project, Spring 2023
- Wildfires, or the uncontrolled burning of vegetation, cause enormous damage and loss of life worldwide every year.1
- 33 people died and >4.3 million acres burned in California in 2020 alone. This is equivalent to the total combined area of Puerto Rico and Rhode Island.
- Although they are a natural component of some forest ecosystems, wildfire incidents are projected to increase in our warming world.2
- While society transitions to a greener future, there is a clear need to predict where wildfires are likely to occur so that communities can protect themselves and evacuate when necessary.
- For this project, I collected a combination of satellite imagery, topographic data (aspect, elevation, and slope), and historical fire boundaries to train models that would predict wildfire risk (Figure 1).
- I first used QGIS, an open-source mapping application, to setup the relationships between the datasets.
- Then used Python to process and model the data.
- I fed the processed data into machine learning and deep learning models to make my predictions.
- I chose to focus on the Santa Monica Mountains in Ventura and Los Angeles Counties, for its relatively manageable size, proximity to populated areas, and its east-west trend, which makes the topographic analysis easier.

- Satellite imagery and topographic data from USGS’ EarthExplorer3
- Historical records of wildfire burn boundaries from the National Interagency Fire Center4
- File formats used: .tif and .jpg (imagery); .geojson and .csv (historical wildfire data).
- In order to train a model that integrates spatial information with historical fire boundaries, I had to find a way to quantify the continuous spatial information. I settled on building an evenly-spaced points layer that would have the topographic and fire boundary data from under that point appended to the file (Figure 2).
- Since the satellite photos were from 2018, I removed all fire polygons from after 2018 and merged the rest to one shape.
- Using the point layer, an overlapping function could determine if a part of the landscape experienced a fire (“fire” area) or not (“nofire” area). The resultant file would serve as the “topographic” data in the project.
- For the geographic data, I focused on aspect (i.e. what direction the land is facing), elevation (meters above sea level), and slope (degrees). It is a straightforward operation to extract these data from an elevation raster and append them to the point layer.
- I tiled the satellite imagery into 128x128 pixels that would be fed into a Tensorflow image model, and saved them into two folders (i.e., fire and nofire).
- At the end of the data collection step, I had a table of geographic data and two image folders (fire areas and non-fire areas).

- Since data is in two forms (point-based topographic information and satellite image tiles), I trained two models. After finding an optimal model, I created a metamodel, which used the predictions from the topographic and image analysis models as factors for a new logistic regression (Figure 3).

- Topographic Data
- The areas affected by fire were predominantly located in mountainous regions with higher mean elevation compared to fire-free areas (Table 1).
- Correlations were observed between different topographic factors, such as categorical aspect with continuous aspect, and categorical slope and elevation with continuous elevation and slope.
LogisticRegression
was chosen as the modeling approach for classifying fire incidence based on topographic data, revealing the relative importance of different features.- Various models were tested, including
naive_bayes
,Bernoulli
andGauss
,XGBClassifier
,RandomForestClassifier
,AdaBoost
, andGradientBoostingClassifier
, withGradientBoostingClassifier
achieving the highest accuracy. - Grid search was employed to optimize the GradientBoostingClassifier model, but the best accuracy remained the same with specific parameter values.

- Satellite Imagery
- BigEarthNet5 suggested the pre-trained VGG19 model to be sufficient, so I used that with Tensorflow-Keras and added a final dense layer with two output nodes to represent the fire/nofire categories required by this project.
- With 20,025,410 total parameters, I deemed the model more than sufficient for the project’s needs.
- After training, the model achieved an accuracy of 95.6% on classifying all the images.
- Figure 4 shows typical images in the set as well as the locations of the two areas (fire and non-fire) that I used to feed the models.


- Metamodel
- To construct the metamodel, I extracted the predictions from the topographic and imagery datasets, used them as features, and ran the two through a scikit-learn
LogisticRegression
, which achieved over 99% accuracy. This concluded the modeling portion of the project.
- To construct the metamodel, I extracted the predictions from the topographic and imagery datasets, used them as features, and ran the two through a scikit-learn
- The statsmodels
Logit
model reveals that higher elevations and west- and north-facing hillslopes correlate with wildfires, potentially due to dominant wind directions in the area. - The project serves as a proof-of-concept, demonstrating the capabilities of a machine-learning model using remotely-sensed data for predicting wildfire risk. Future iterations can expand by incorporating weather and time series analysis and testing the model on different landscapes.
- Despite limitations, such as lack of high-resolution weather data, the project shows promise for developing a robust wildfire risk prediction model with broader data and automated GIS workflows.
- Publishing the model in a more accessible manner could provide communities with valuable insights into their wildfire risk.
- Accuracy is almost 100%, which is suspiciously high
- Since the model was trained on either 100% burned areas or 100% unburned areas, it only knows the clearly-delineated cases.
- Furthermore, when tested, the model was only given 100% burned or unburned areas
- The unburned area was a city center. There would never be a wildfire in a concrete jungle
- The opposite is similarly always true: in the mountains, away from all civilization, fires are extremely likely and are much harder to combat
- Correlation matrix of the geographic data:

- Summary of models and results for TopoData:

- Summary of satellite imagery model:

- Summary of metamodel:

- Satellite image and historic wildfire boundaries (up to 2018):

- Satellite image and hillshade of study area:
