**Predicting Crime in Boston with Weather Data**

The Boston Police Department is leading the effort to combat crimes using data science, keeping a database containing every recorded crime since 2014.

This project explores avenues for predicting daily crimes rates using temporal and weather data provided by the National Center for Environmental Information (NCEI).

We employ linear regressions and decision trees to predict the number of crimes on a given day solely based on temporal as well as weather variables.

We discovered that month, temperature, precipitation, and day of the week are the most powerful variables for predicting daily crime rates in Boston.

Because our models rely only on such trackable and predictable variables, we believe that our results may provide Boston administrators with some insight on how to manage their resources more efficiently only by looking at the date or their weather app, and inspire others to explore further into the relationship between crime and weather.

We have collected and compiled incident reports from the Boston Police Department and weather data from the NCEI between August 2015 and February 2020 into a table containing the number of crimes and the weather of each day.

We explored a number of different supervised learning techniques to achieve our goal of predicting the number of crimes in a given day based on the weather data.

For each approach, we analyzed the model accuracy (MSE, MAE and adjusted R-squared, etc.) and model interpretability.

We started with linear regression models and then improved upon it by using more complex decision tree models including Random Forest and Boosting.

Finally, we applied the best subset selection model to our cleaned data set. Overall, the best subset selection model appeared to be the most superior model among all the models, given its high test accuracy and interpretability.

We started at a preliminary level by training a simple linear regression model on the variables that have significant p-values at a confidence level of 0.9, and using a training/test split of 4:1 to test the model accuracy. The adjusted R-squared and MSE of the simple linear model were 0.3792 and 929, respectively.

The Random Forest technique is considered to be an appropriate method here because it tends to overfit less and would further reduce the correlation among individual trees.

Overfitting usually occurs when fitting the data using a too complex model which results in an extremely low training error and a high test error.

In Decision tree models, Random Forest technique helps de-correlate trees and reduce variance significantly, therefore it is an improvement over Bagging and tends to overfit less.

We trained a Random Forest model by randomly sampling 4 predictors out of 8 based on the lowest MSE and found the best split based on them. Compared to the previous result, the MSE of Random Forest decreased to 836 and MAE decreased to 20.94.

In addition to Random Forest, we also tried the Boosting algorithms.

Boosting algorithms seek to produce a powerful predictor by training over a series of simple models, each trying to correct its predecessor with a small learning rate. Our Boosting model used 1000 subtrees which iterate over a set of learning rates ranging from 2-1 to 2-8, and the lambda with the lowest test error was chosen.

We also removed the bottom half of the variables in the influence plot which have the least influence in the tree model, and then refitted the model with a subset of the original predictors.

This model used a validation set that was 25% of the original data, resulting in an MSE of 588, an MAE of 18.78, and an adjusted R-squared of 0.452. The model accuracy was 93.21%.

Although Boosting produced the lowest test error compared to other models we analyzed so far, its low interpretability could limit the model’s application.

Therefore, we used a presumably more interpretable model – the best subset selection approach – to predict the number of crimes, which consists of testing all possible combinations of the predictor variables, and then selected the best model according to the Bayesian information criterion.

The Bayesian information criterion chose the model with 16 variables, which produced an MSE of 617 and an adjusted R-squared of 0.428. The accuracy of the best subset model was 92.6%.

Best subset selection usually has high accuracy but could take a lot of time to compute. But in this case, since we have a relatively small number of predictors, best subset selection appears to be a great choice. Overall, this model had relatively high accuracy and also sufficiently high interpretability.

In summary, the best subset selection model using 16 predictors proved to be the superior model because it produced a relatively low test error, showed no sign of overfitting, and is highly interpretable.

Although we did get a lower MSE using Boosting, it lacks the interpretability due to the complex interactions between the independent variables.

Our findings surprised and amazed us; we can use only weather and time to explain just under half of the variance of crime, and predict it with about 93% accuracy.

We hope that more powerful and specific models will be developed in the future and will be paired with police department policies to mitigate crime more effectively.

**Written by Zoe Wang**

**Edited by Alexander Fleiss, Paul Marrinan, Harold Moss, Zifan Wang, Shaw Rhinelander, Kevin Ma & Glen Oh**