Main_LM_Analysis.Rmd

---
title: "Main Linear Model Analysis"
author: "Sho"
date: "11/26/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```


# Loading in Data 

Loading as done by Seokjun. 

```{r}
data <- read.csv("MH_survey_only_higher_index.csv",
    na.strings = "NA")
names(data)
#only relevant for Seokjun's PC <- thanks :D
names(data)[1] <- "gender"

data <- data[, -c(17:22)]


# Assigning factors
data$gender <- factor(data$gender)
data$age_group <- factor(data$age_group)
data$country_lockdown <- factor(data$country_lockdown)
data$marital <- factor(data$marital)
data$smoking <- factor(data$smoking)
data$fivfruitveg <- factor(data$fivfruitveg)
data$shielded <- factor(data$shielded)
data$week_soc_distancing <- factor(data$week_soc_distancing)
data$athlete <- factor(data$athlete)
```


# Step-wise Regression 

Here we conduct the step-wise regression using step. 
Note for Seokjun: `step` actually terminates at a different point if you take out missing data. 
This is an issue. 

The final model is pretty different than what we had before. 

```{r}
data_without_na <- data[complete.cases(data), ]
dim(data_without_na)

model_full <- lm(MHC_SF_OVERALL ~ ., data = data[complete.cases(data), ])
summary(model_full)

step(model_full, direction = "backward")
```


This is the model that results through the `step` function. 

```{r}
model_step9 = lm(formula = MHC_SF_OVERALL ~ gender + age_group + week_soc_distancing + 
    athlete + HADS_OVERALL + RES_TOTAL + LONE_TOTAL, data = data[complete.cases(data), ])
model_step9  %>% summary()
```


# Removing variables with Collinearity Issues: RES & HADS 
Determine which variable to remove (due to collinearity issue: RES or HADS):

```{r}
no_HADS = lm(formula = MHC_SF_OVERALL ~ gender + age_group + week_soc_distancing + 
    athlete+ RES_TOTAL + LONE_TOTAL, data = data[complete.cases(data), ])

no_RES = lm(formula = MHC_SF_OVERALL ~ gender + age_group + week_soc_distancing + 
    athlete + HADS_OVERALL + LONE_TOTAL, data = data[complete.cases(data), ])

paste0("AIC without HADS: ", c(no_HADS %>% extractAIC())[2] %>% round(2))
paste0("AIC without RES: ", c(no_RES %>% extractAIC())[2] %>% round(2))
```

AIC is significantly lower without RES. For the final model, we will remove RES. 


# Final Model 

```{r}
final_model = lm(formula = MHC_SF_OVERALL ~ gender + age_group + week_soc_distancing + 
    athlete + HADS_OVERALL + LONE_TOTAL, data = data[complete.cases(data), ])
final_model %>% summary()
```


# collapse trials
```{r}
as.numeric(data$age)
age_cut <- ifelse(as.numeric(data$age_group) >= 6, 1, 0)
data$age_cut <- factor(age_cut)

soc_dist_cut <- ifelse(as.numeric(data$week_soc_distancing)-1 >= 7, 1, 0) #warning! added 1 to original factor value
data$soc_dist_cut <- factor(soc_dist_cut)

final_model_cut = lm(formula = MHC_SF_OVERALL ~ gender + age_cut + soc_dist_cut +
    athlete + HADS_OVERALL + LONE_TOTAL, data = data[complete.cases(data), ])
summary(final_model_cut)

```


```{r fig.width=8, fig.height=5}
par(mfrow = c(2, 2))
plot(final_model)
```

Residual plots look almost perfect!  We will look at the outliers below. 
Potential homoskedasticity? 


# Testing All Interactions

Here we run the F-Test on all secondary interactions.  It returned a small F statistic and high p-value. 
This is good news!  Less LaTeX work. 

```{r}
model_interaction = lm(formula = MHC_SF_OVERALL ~ (gender + age_group + week_soc_distancing + 
    athlete + HADS_OVERALL + + LONE_TOTAL)^2, data = data[complete.cases(data), ])
anova(final_model, model_interaction)
```


# F-Test for Model Significance + T-test for continuous/two-category variables

This can be done just via the `summary` function. 

```{r}
final_model %>% summary()
anova(final_model) #type 1
# explanation: https://stats.stackexchange.com/questions/20452/how-to-interpret-type-i-type-ii-and-type-iii-anova-and-manova/20455#20455
library(car)
Anova(final_model) #type 2
```


```{r}
w.o.gender = lm(formula = MHC_SF_OVERALL ~ age_group + week_soc_distancing + 
    athlete + HADS_OVERALL + LONE_TOTAL, data = data[complete.cases(data), ])
anova(w.o.gender, final_model)
```


Gender & Athlete are both significant at 0.05. 
HADS, LONE are also significant.  We will now do F-test on `age_group` and `week_soc_distancing`. 

First with age_group 
```{r}
w.o.age_group = lm(formula = MHC_SF_OVERALL ~ gender + week_soc_distancing + 
    athlete + HADS_OVERALL + LONE_TOTAL, data = data[complete.cases(data), ])
anova(w.o.age_group, final_model)
```

Significant at 0.05

Now with `week_soc_distancing`:


```{r}
w.o.week_soc_dist = lm(formula = MHC_SF_OVERALL ~ gender + age_group + 
    athlete + HADS_OVERALL + LONE_TOTAL, data = data[complete.cases(data), ])
anova(final_model, w.o.week_soc_dist)
```


Significant at 0.001!    


# visualize fitting result

```{r fig.height=5, fig.width=8}
library(car)
# ?avPlots
avPlots(final_model, terms=~HADS_OVERALL + LONE_TOTAL)
```


# diagnosis !
# Outlier Analysis

1. see residual
```{r}
par(mfrow = c(2, 2))
plot(final_model)

par(mfrow = c(1, 1))
plot(final_model, which = 1) #delete 364? (index is 321, after deleting NAs)
which.max(final_model$residuals); final_model$residuals[321]
data[364, ]

par(mfrow = c(3, 2))
for(i in (names(final_model$model)[-1])) {
    print(i)
    boxplot(data[, i], xlab = i)
    points(1, data[364, i], pch = 19, col = "red", cex = 2)
}
```


```{r}
par(mfrow = c(1, 1))
# vs fitted
plot(final_model$residuals ~ final_model$fitted.values)
abline(h = 0)
which.max(final_model$residuals)
final_model$residuals[321] #364 before deleting NA, 321 after deleting NA

#vs predictors
n_lab_to_boxplot <- function(factor_var, y=1) {
    lev <- levels(factor_var)
    for(i in seq_along(lev)) {
        text(i, y, paste("n=", sum(factor_var == lev[i]), sep = ""))
    }
}


plot(final_model$residuals ~ data_without_na$gender)
plot(final_model$residuals ~ data_without_na$age_group)
plot(final_model$residuals ~ data_without_na$week_soc_distancing)  # <- hmm. but boxes are similar
n_lab_to_boxplot(data_without_na$week_soc_distancing, y = 25) #we can guess what makes the problem

plot(final_model$residuals ~ data_without_na$athlete)
plot(final_model$residuals ~ data_without_na$HADS_OVERALL)
abline(h = 0)
plot(final_model$residuals ~ data_without_na$LONE_TOTAL)
abline(h = 0)

```


2. see leverage
```{r}
par(mfrow = c(1, 1)); plot(final_model, which = 4)
par(mfrow = c(2, 2)); plot(final_model)

final_model$model[c(296, 341, 405), ] #not these
final_model$model[c(266, 301, 354), ] #but these
#or
data[c(296, 341, 405),]


par(mfrow = c(4,4))
for(i in 1:16){
    boxplot(data[, i], xlab = colnames(data)[i])
    points(1, data[296, i], pch = 19, col = "red", cex = 2)
    points(1, data[341, i], pch = 19, col = "blue", cex = 2)
    points(1, data[405, i], pch = 19, col = "green", cex = 2)
}
```

The outlier point is unusual in terms of weeks social distancing and the maximum LONE score (very lonely).  
No Cook's distance is above 1 (or 0.5) so even the large points (296, 341, and 405) are not concerning. 


Compare this to the overall data: 

```{r}
final_model$model %>% summary()
```

Comment: not obvious to me why these are outliers (but it is hard when there are so many variables).
Seokjun: see boxplots above (they seem ok)


comment (nov 27 3:20)
issues:

1. delete high-residual point? (364th)

2. I think variance problem seems not big. (see above residual plots, verse predictors)

3. visualize fitting result

more partial-regression plot?


(optional)
7. if we want: make prediction example?
    
    Sho: Not yet
    nov 27: later (if our paper seems too short)
    

comment (nov 27 1:30 am): 

1. delete HAD or RES (because of Collinearity) <- we can do this right after 'step' function result

    Sho: Done, compared using AIC. Chose HAD. 

2. Draw diagnosis plot (residual, normal qq, leverage, ... using 'plot(lm_fit_object)', 
    Sho: Done. 
    
    Delete outlier/other annoying points.
    
    Sho: Did not do yet .
    
    Please make a figure to show an outlier if it exists.
    
    Sho: Done?
    
    And, if there are variance problem (heteroskedasticity or ...), stop and rest (let's discuss together)

3. re-fit lm (after deleting things in 1 and 2)
    
    Sho: Did not remove any outliers but removed HADS to create `final_model`. 
    
4. basic test: F test for model significance(first!) -> 
    t test for continuous variable, model comparison F-test (for each categorical variable group)
    
    Sho: Done.
    
5. visualize fitting result (if we can. if it is too high dimention, just make a table). See R^2/AIC/BIC,...
    
    Sho: Tried... 

6. residual analysis (same as 2)
    
    Sho: not yet?

7. if we want: make prediction example?
    
    Sho: Not yet
    nov 27: later (if our paper seems too short)
    
8. Tested all interaction terms (secondary) <---- Added 

   Sho: no terms were significant.  

(I think) By Tuesday, we don't have to finish writing everything, but
just 4,5,6(+7) steps table and figure is enough.

Let's write together. Don't write it all alone! I'll help you

And, please give me an idea to visualize the Lasso fit. 
After making 2 plots, I don't know what to do.