Catalina Villouta, Haoran Liao, Vincent Lieng
COVID-19 pandemic has created the largest public health crisis in decades. Since the outbreak, there has been tremendous interests in attempting to forecast the confirmed cases and the death tolls, and to predict the course of the pandemic, so as to better inform public health policies. In this project, we make use of publicly available data repository on U.S. COVID-19 related statistics in the year of 2020 and 2021 to build a model that forecasts the death tolls in each state in the next week. We engineered and selected time-lagged features, including how the virus spreads geographically between states, and experimented on several models. In particular, we built a Ridge regression model that achieves a 94% cross-validation R squared with informative interpretations on the various features contribut- ing to the forecast. We hope that our model can be used in assisting the prediction of the course of the pandemic.