This article talks about the risk level estimation for** natural hazard **and how new tools could be used to measure the forest fire risk level in an area for insurance purposes. We proposed an **integration of geo-spatial risk model** with the insurance business model **using areal satellite imagery** and integrating **location stats** into insurance business model.

Each year insurance companies, and their clients lose hundreds of **millions of dollars** due to errors in forest fire detection. Thus, improvement in risk identification model for forest fire prediction model is a **necessity** for insurance businesses. We have demonstrated how advanced tools like, **Machine learning,** big data, and remote sensing could be the keys to predicting forest fires in an area.

**Data Preparation**

First step is to acquiring fire data from __NASAâ€™s FIRMS application__. The Fire Information for Resource Management System (FIRMS) dataset includes daily counts for more than 5 years. With FIRMS API, we can get data for fire occurrences in a region and then use the data to create an API that will perform live fire forecast prediction for us.

**Dataframe Format**

**WKT** column contain longitude and latitude in scipy POINT format so that one can plot them on a map using geopandas.

**acq_date** column contain the acquire date of fire point

**Note**: there is a latency of around 4hours in capturing the data by satellite and fetching it through API

**Historical Data**

**Data For Region Analysis**

For demonstration purpose, we are focusing on the **northern part of India** which includes Haryana, Punjab, Uttarakhand, and some parts of Uttar Pradesh. In our model, we are focusing on the number of fire points in a region per day.

To get the number of fire points in a region per day, we first filtered data for any region using a **bounding box **to get the max-min limits for longitude and latitude. After which, data extraction and **transformation** got us to the day-wise fire counts for a specific region.

Once we got the day-wise fire counts for a specific region, through the time-series analysis we observed that the dataset doesnâ€™t have counts for all dates as seen in the image below.

Thus, to create and train our ML Model we will need to further process out dataset.

**Missing Values Representation for the year 2017-2018**

**Data Preprocessing**

As we know that any machine learning model cannot be trained with data that have null values. To tackle these null values, we can delete those data points from the dataset. But in our case, we are dealing with time-series data in which **continuity is an important factor**. We used **Imputation** which is the process of replacing missing data with substituted values

There are a lot of ways to impute data -

**Mean value imputation: **In this method of imputation, we replace the missing values with the mean of that column.

**Median value imputation:** The missing values are replaced with the median of the column.

**Linear Interpolation: **Linear interpolation is an imputation technique that assumes a linear relationship between data points and utilizes non-missing values from adjacent data points to compute a value for a missing data point.

**Replacement of Missing Values Using Imputation**

**Dealing with outliers**

The next step in **preprocessing** is dealing with **outliers**. In the data set, there are some days for which the fire point count is around 5000. To remove these large bumps we used an upper bound such that if someday the fire count is greater than that bound then it will be reset to that bound. The upper bound is calculated by the trial and error method in such a way that it removes maximum outliers and the overall dataset loses less pattern information.

**Data After Removing Outlier**

**Scaling the dataset**

The last Step of preprocessing is scaling the dataset. It is good to scale the dataset between 0 to 1 or using a min-max scaler. This difference between the minimum and maximum is stored in a variable called the multiplier factor so that we can get back the actual count whenever we want.

**Split the data**

We'll use a (70%, 20%, 10%) split for the training, test, and validation sets. Note the data is not being randomly shuffled before splitting. This is for two reasons. It ensures that chopping the data into **windows of consecutive samples** is still possible. Also, it ensures that the validation/test results are more realistic, being evaluated on data collected after the model was trained.

**Data Windowing**

Given a sequence of numbers for a time series dataset, we can restructure the data to look like a **supervised learning **problem**.** We can do this by using previous time steps as input variables and use the next time step as the output variable.

**Window Generation for Dataset**

Here the window contains data of 8 timestamps out of which the first seven timestamps are used as input to the model and the 8th timestamp is used as the target. During inference, we need only 7 previous timestamps to predict the next.

**Model**

After looking into the data and preprocessing it for the model. It's time to think about designing model architecture. To design the model architecture we should know what are the minimum capabilities of our model or in other words what are the key patterns in data that our model should be able to learn. For example in our case the patterns in a data window are

Mean of the window

The fire counts are related to the month of the year (seasonality)

**To learn these patterns, we implemented a model that understands the mean, monthly seasonality, and normal sequencing properties of the window. **

**Model With Three Concatenation Layers**

**Deep Learning Neural Network Model Layers**

**Training**

Model is trained with the **Adam optimizer** because Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimisation algorithm that can handle sparse gradients on noisy problems. This algorithm leverages the power of adaptive learning rates methods to find individual learning rates for each parameter.

**Loss**

This problem is a regression problem so we are using the MSE loss function.

**Plot of Training and Validation Loss**

The model was able to learn the data pattern for this region. More accuracy can be achieved by taking into the factor of ground data from __Forest Survey of India __and using __ISRO__ remote sensing technologies. We will look into it in the next article.

Thank you for reading, we hope you find our article interesting!

**To integrate predicted and past forest fire data in your business then contact us here - **__contact@godatainsights.com__

**Author**-

## Comments