An analytical system to understand and visualize the patterns of COVID-19 effect and spread across different counties of the United States
This project aims to examine COVID-19 data in the United States in order to view possible trends in the virus’s spread. This involves analyzing the number of cases, the number of deaths, and the various counties’ population. This data is combined with several enrichment dataset like hospital beds, presidential election results, employment, economic characteristics, and demographic information.
Covid-19 dataset combined with the enrichment dataset can help understand the pattern in the change in number of cases and deaths and their correlation with different factors. Using these data, a linear and non-linear regression models is developed for predicting the number of cases and deaths due to COVID-10 in the United States. The data is processed using statistical models and presented using graphs with a trendline, confidence intervals, and a prediction path. Ultimately, a simple interactive dashboard is created based on the analysis where user can visualize the present trend, predictions, moving average and more.
This is the first stage of the project where we get acquainted with the COVID-19 dataset. These datasets are provided by USAFacts. We have utilized the daily county-level tracker of COVID-19 cases in the US. You can use the links below to download the granular level data from USAFacts.
This is the second stage of the project where we dig deep into data modeling and hypothesis testing. With the preliminary intuitions we had in stage 1, we develop a formal hypothesis and use statistical modeling to prove or disprove the hypothesis. We compare the weekly statistics by using mean, median, mode for our 3 main variables and plot the daily trends in a meaningful way. We also search for correlation between different features. Additionally, we compare the data of the United States against other countries with the World dataset.
In this stage, we dive into developing linear and non-linear regression models for predicting the cases and deaths in the United States. Machine learning and statistical models are used to predict the trend of COVID-19 cases/deaths. We also plot trend line and forecast our prediction of 1 week ahead. Confidence intervals is introduced to analyze the error in prediction. Hypothesis testing on the hypothesis formulated in stage 2 of the project is performed.
With the use of frameworks like Plotly along with Dash, we develop an interactive dashboard for our fourth and final stage of the project. We allow for the selection of date(s), states and linear/log normalization as different methods to filter the data and present the result/ analysis in a group. This group is interactive and also easily interpretable.
This document details further about the dashboard, provides different snapshots along with a short description of what each snapshot represents.
- Python: 3.7
- Jupyter Notebook: 6.2.0
To run this project locally, make sure you have python, pip and jupyter notebook installed. You will also need some additional python libraries to run the project properly. You can install all libraries using the code below. On your project parent directory, run:
pip install -r requirements.txt
To open jupyter notebook, while you are in the project's parent directory, run:
jupyter notebook
Stage I: Complete
Stage II: Complete
Stage III: Complete
Stage IV: Complete