Skip to content

A spatio-temporal modeling of COVID-19 spread in the UK between Feb 2020 and April 2021

License

Notifications You must be signed in to change notification settings

andreanasuto/covid19-spatio-temporal-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spatio-temporal COVID-19 Modeling in the UK

A spatio-temporal modeling of COVID-19 spread in the UK between Feb 2020 and April 2021. The models uses three predictors: the percentage of the population working from home, the households living in crowded housing conditions and the percentage of residents part of a black or ethnic minority group (BME). A total of six models are produced using linear regression, geographical weighted regression (GWR) and basis function. A series of visualizations are built to display how the virus spread spatially and temporally in the UK and how models perform across space, including if they are statistically significance.

Introduction

On March 11th 2020, the outbreak caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was declared a global pandemic by the WHO (WHO, 2020). In the UK, the number of cases, whatever undetected or not present, stayed largely low until mid March. More than 14 months later, 4,433,000 British people tested positive for COVID-19 and 127,603 lost their life (last update: 9 May 2021, NYT/Public Health Office). This study focuses on designing models across different technical and theoretical approaches to analyze, discuss and potentially predict the COVID-19 infection in the UK, focusing on the nonspatial stationarity of the variables i.e. the fact that a dependent variable and a set of predictors may vary across geographical space. The dataset aggregates COVID-19 cases elaborated by public sources containing information about the tested positive cases, census data and IMD score across Upper Tear Location Authorities (UTLA) between 31/01/2020 and 05/02/2021

Part I

Methods The cumulative cases and rate of infections by 100,000 people are calculated in the total time range available. Three predictors are selected: total working population remotely working from home, the households living in crowded housing and the residents part of a black or ethnic minority group (BME). A new dataframe is created with the selected variables and the proportional values across areas. The percentage of people working from home might be underestimated since the total residents' count includes those that are not of working age. A first multi-level model is built with only the intercept varying across regions. In addition to the varying intercept, the second multi-level model adds three potential individual-level predictors. The variables are chosen based on the existing literature backing the role of the remote working (Robinson et al., 2020), the crowded housing conditions (Tinson and Clair, 2020) and the presence of black or minority ethnic groups i.e. BME (Green et al., 2021) as key determinants in the spread of COVID-19. Lastly, a Variance Partition Component (VPC) analysis is conducted to compare the two models.

Results and interpretation In model one, only the intercept varies by region. The summary table of the model shows the estimated standard errors (SE) both for the intercept and the residuals. According to this model, on average within regions, the SE of the infection rate deviates by 1,164 while the unexplained component of this variability equals 1,598. In model one the fixed effects (FEs) for the intercept simply represent the average value of the cases across regions (6436.4). In model two (Table Two) the estimated SE is smaller both for the intercept and the residual part compared to model one i.e. there is less variation in the number of cases across regions in the UK. In multi-level modeling, the FEs are the standard linear regression coefficients and their interpretation is the usual. A 1% increase in the percentage of people working from home and living in crowded housing decrease the infection rate by -111,984.65 and -11,431.52 respectively while an increase in the BME would increase the rate by 5,685.29. As shown in previous research, BME groups tend to be disproportionately represented in the so-called key workers i.e. workers providing essential services during the lockdowns (The Health Foundation, 2020). As result, they have less possibility to shield themselves from risky face-to-face contacts. They have a higher percentage of positive cases out of their total COVID-19 tested population compared to other ethnic groups, probably because they are tested less (Green et al., 2021). Crowded housing is another key element in the COVID-19 spread (Tinson and Clair, 2020). This is not confirmed in the model. However, while all the variables seem statistically significant given their low p-value < 0.001, the crowded housing feature is not (p-value = 0.24). Working from home is a key strategy to reduce mobility and face-to-face interactions. Mobility, in general, seems to be one of the most important elements to assess differences within areas and ethnic groups in terms of deaths and COVID-19 infections (Chang et al., 2020). In these two models, the VPC is the correlation of randomly picked values of the rate of infection (the dependent variable) between regions. In model one is 0.35 and in model roughly the same (0.36). This means that in both cases roughly 35% of the variability is explained between groups (UK regions) while 65% by variation within each region. Model two with individual predictors has a better total pseudo R-Squared (0.75) compared to the only intercept-varying model one (0.35), the AIC score is also lower than model one.

Part II

Methods A first model (model three) is built using the same variables as in Part I. The variable that has a varying slope by region is the percentage of total residents working from home. This number varies vastly across professions (Bartik et al., 2020) and as result, it might also vary by areas based on the concentration of the professions allowing remote working (Felstead and Reuschke, 2020). A second model (model four) is built allowing the same aforementioned variable to vary both by intercept and slope. Lastly, an ANOVA analysis is run on both models to compare them.

Results and interpretation Model four has the slope of the ‘Work From Home’ variable to vary across regions. On average across areas, the standard error of the variable’s coefficient is 21,985 - i.e. how much it deviates from model national average at the regional level - and the unexplained variability of the intercept within regions has an estimated standard error of 1,037 - i.e. how areas inside the region on average deviates from the regional average. While these results might be particularly interesting, they are statistically significant only in two regions (Fig. 1). The FEs have similar interpretations and limitations as in model two (Table Two). A 1% increase in people working from home and residents in crowded housing will reduce the number of total cases by 115,571.8 and 7243.4 respectively while the BME predictor will increase the total cases per 100,000 people by 5743.4 for the same increment. Model four further extends model three allowing the intercept of the remotely working residents to vary across regions as well as the slope. As in model three (Table Two), the average difference in the slope across regions is 28,727 (SE) while the intercept is 1,336. The unexplained component that varies within regions equals 927. Interestingly, as the intercepts regions tend to be higher, the coefficients are almost 1:1 proportionally lower since the correlation is -0.94. This might suggest that in areas where cases are consistently higher on average, an increase in the percentage of people working remotely might not be beneficial as in other regions since the coefficient will be low in value. The estimate of the intercept variation is not statistically significant across any regions while in the case of the slope is statistically significant in four cases out of nine (Fig 1). When it comes to interpreting the fixed effects, the results and interpretation are similar to previous models. Lastly, using the ANOVA analysis, model four seems a better fit given a lower AIC score compared to model three.

Part III

Methods Two geographically weighted regression (GWR) models are built using two different approaches. A new standard linear regression model (model five) is set using the same variables since model two. A variance inflation factor (VIF) test is conducted to assess potential multicollinearity. Through a cross-validation technique, a fixed and adaptive bandwidth are calculated. Based on them, two GWR models are fitted and the R-Squared are calculated across the different UTLAs displaying their values across areas with two maps, joint by a third map displaying the differences in the R-squared between the two models (Fig. 2). To assess the statistical significance, the t-students are calculated and an arbitrary benchmark is set to 2 to classify the predictors’ significance. The results are then displayed in six maps (Fig. 2)

Results and interpretation The standard regression model explains 66% of the total variability of the rate of infections (Adjusted R-Squared value) and it might not have a major issue with multicollinearity in the predictors; all the VIF scores are <10. (Belsley et al., 2015.) The fixed bandwidth has a value of approximately 40km while the adaptive is 20m. The Quasi-Global R-squared for model five (fixed bandwidth) is 0.89 while for model six (adaptive bandwidth) is 0.92. The interpretation of the coefficients is similar among the two models. The associations drastically change across areas. In model five (Table One), a 1% increase in the number of remote workers reduces the cases (between 238,675 and 89,729 less) but it might also increase them by 5,971 (median = -119,459). An increase of 1% of BME residents might grow the cases by 38,868 as well as decrease them by 4,502 (median = 4871). Lastly, the ‘crowded housing’ has a median value that decreases the total number of cases by 3,755 but it might increase them in a range between 8,805 or 188,984 based on the areas. Similar patterns and interpretations are seen in model six. However, it is worth noticing how the ‘work from home’ feature has a consistently negative coefficient compared to model five across UTLAs. The statistical significance of the predictors in the models vastly changes across predictors and areas (Figure Two). The ‘Work from Home’ is mostly significant in both models with some exceptions in Lincolnshire and Cornwall regions for the fixed bandwidth model. The majority of the results for the BME are statistically insignificant in both models and for ‘crowded housing’ just a tiny percentage of the total UTLAs are significant. As results, the interpretation of the coefficients needs caution. The two bandwidths might both improve and reduce the R-squared between areas. For example, North East Lincolnshire and East Yorkshire have a lower R-squared with the adaptive kernel. However, in the South East, Norfolk and North East, it is higher than with the fixed band. Model five’s bandwidth looks at 40% of its nearest UTLAs neighbors from a given point while the adaptive bandwidth only 2%. This element might explain the difference in predictability performances.

Part IV

Methods The spread of a virus is a chain-linked event placed in time and space. A person gets infected by another person across a certain time. The serial interval is “the time interval for which the infector and infectee show the symptoms” and it is an essential epidemiological modeling premise (Rai et al, 2021). The COVID-19 serial interval estimate is slightly higher than 5 days (Rai et al., 2021), as result, the cases are grouped by this time interval and multiple maps are built to show the variation of the infection rate across time. Once a timestamp is defined, a spatio-temporal (ST) model is built to study how the infection rate evolves across space and time in the UK. Its results are compared to GWR models. To visualize the temporal variations a Hovmöller Plots is created with UTLAs larger than 300,000 citizens (Fig 3).

Results and interpretation Starting from mid-March, COVID-19 cases started to emerge in the UK. As Fig 3 shows key areas were in major cities as London (as Hamlets Tower, Redbridge), Manchester and Liverpool. The ‘three waves’ structure is fairly visible, particularly the difference in the intensity between a new spread around October 2020 and Jan 2021 (Fig 3). During October, firstly and mostly the areas of Nottingham, Liverpool and Manchester were hit and while London area in January. Both GWR models have a better R-squared score compared to the ST model (adjusted R-squared=0.37). However, while in the previous two models, the ‘crowded housing’ and ‘BME’ predictors are largely statistically not significant, in the ST models they are. The percentage of people working from home and living in crowded housing is still negative. A 1% increase will decrease the 5-days registered cases by 100,000 people, by 319 and 32 respectively. An increase in the percentage of BME residents will increase the cases by 12. Looking at the temporal variable (‘by five days’), the model suggests that the cases would slightly increase across a 5 days window (0.17) in line with previous similar studies (Davies et al., 2021). Spatially, as the lens of analysis moves West, the cases will decrease and vice versa while moving South they should increase, however the latitude predictor is not statistically significant. Globally, only 3 out of 15 predictors are statistically insignificant. One limitation, compared to the GWR models is that the interpretation of the basis functions is less than straightforward and some of them are not statistically significant.

Conclusions

The study suggests how COVID-19 infections might change across geographical space. The ‘work from home’ policies seems particularly effective in reducing the spread, while a higher presence of people from BME groups suggest more cases probably because they are over disproportionally represented in the ‘essential workers’ category (eg. groceries, public transportation) and as result more subject to risk (The Health Foundation, 2020; Robinson et al., 2020). Less clear, it is the role of overcrowding houses especially since the existing literature suggests a positive relationship with the infection rate. The models might require further tuning to reduce the statistical insignificance of the variable. Further research explorations include studying across professions and areas how remote working might have impacted the infection rates, potentially suggesting areas of public policy improvements.

About

A spatio-temporal modeling of COVID-19 spread in the UK between Feb 2020 and April 2021

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published