principal_component_regression.Rmd

---
title: "Principal Components Regression"
output:
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
set.seed(2)
source("install_packages.r")
require(knitr)
require(plotly)
require(ggplot2)
require(pls)
require(ISLR)
require(psych)
```

## Data

```{r}
Hitters <- Hitters
names(Hitters)
summary(Hitters)
sum(is.na(Hitters$Salary))
Hitters <- na.omit(Hitters)
```

| Variable   | Description                                                                         |   |
|------------|-------------------------------------------------------------------------------------|---|
| AtBat      |  Number of times at bat in 1986                                                     |   |
| Hits       |  Number of hits in 1986                                                             |   |
| HmRun      |  Number of home runs in 1986                                                        |   |
| Runs       |  Number of runs in 1986                                                             |   |
| RBI        |  Number of runs batted in in 1986                                                   |   |
| Walks      |  Number of walks in 1986                                                            |   |
| Years      |  Number of years in the major   leagues                                             |   |
| CAtBat     |  Number of times at bat during his   career                                         |   |
| CHits      |  Number of hits during his career                                                   |   |
| CHmRun     |  Number of home runs during his   career                                            |   |
| CRuns      |  Number of runs during his career                                                   |   |
| CRBI       |  Number of runs batted in during   his career                                       |   |
| Cwalks     |  Number of walks during his career                                                  |   |
| League     |  A factor with levels A and N   indicating player’s league at the end of 1986       |   |
| Division   |  A factor with levels E and W   indicating player’s division at the end of 1986     |   |
| PutOuts    |  Number of put outs in 1986                                                         |   |
| Assists    |  Number of assists in 1986                                                          |   |
| Errors     |  Number of errors in 1986                                                           |   |
| Salary     |  1987 annual salary on opening day   in thousands of dollars                        |   |
| NewLeague  |  A factor with levels A and N   indicating player’s league at the beginning of 1987 |   |


## OLS Prediction
```{r}
ols.fit <- lm(Salary~., data=Hitters)
summary(ols.fit)
ols.fitted <- predict(ols.fit)
ols.fitted
MSE.ols <- mean((ols.fitted - Hitters$Salary)^2)
MSE.ols
```

## Principal Component Analysis

```{r}
mat <- Hitters[,-19] #exclude salary
unique(mat$Division) #E W
unique(mat$NewLeague) #A N
unique(mat$League) #A N
#encoding factors 
mat$Division <- as.numeric(mat$Division=="E")
mat$NewLeague <- as.numeric(mat$NewLeague=="A")
mat$League <- as.numeric(mat$League=="A")
mat <- as.matrix(mat)
dim(mat)
```

KMO test
KMO > .9 were marvelous, in the .80s, mertitourious, in the .70s, middling, in the .60s, medicore, in the 50s, miserable, and less than .5, unacceptable.
```{r}
KMO(mat)
```
0.71 

```{r}
pca <- princomp(mat,cor=T)
screeplot(pca, type="lines")
summary(pca)
pca$loadings[,1:3]
```

Rotation for better intepretation of components. 
PCA with package `psych`.
Covar = F because matrix is not scaled. 

```{r}
scree(cor(mat))
pca2 <- principal(mat, nfactors=3, covar=F, rotate="varimax")
pca2
```

Interpretation:
RC1 - seniority of player
RC2 - players performance in 1986
RC3 - 


```{r}
gdta <- data.frame(Salary=Hitters$Salary,RC1=pca2$scores[,1],RC2=pca2$scores[,2])
gdta$Name <- row.names(Hitters)
ggplotly(
  ggplot(gdta)+
    geom_point(aes(RC1,RC2,color=Salary, name=Name))+
    scale_colour_gradient(low="yellow",high="red")+
    xlab("PC1")+
    ylab("PC2"),
  tooltip = c("Name", "Salary")
  )

```

```{r}
Hitters["-Mike Schmidt",]
Hitters["-Don Mattingly",]
```


```{r}
summary(lm(Hitters$Salary~pca2$scores))
```


## Principal Component Regression

```{r}
help(package=pls)
?pcr
pcr.fit <- pcr(Salary~., data=Hitters,scale=TRUE,validation="CV")
summary(pcr.fit)
validationplot(pcr.fit, val.type="RMSEP")
```

```{r}
pcr.fit.5 <- pcr(Salary~., data=Hitters, scale=TRUE, ncomp=5)
summary(pcr.fit.5)
```

```{r}
pcr.fitted <- predict(pcr.fit.5, comps=5)
#pcr.fitted
MSE.pcr <- mean((pcr.fitted[ ,1]-Hitters$Salary)^2)
```

```{r}
MSE.ols
MSE.pcr
```

Considering FITTED VALUES (not outside the train sample) lm() regression of Salary on all 19 regressors outperforms 5-component pcr  as far as MSE is concerned.... 


To assess "true prediction" efficiency, we shall split "Hitters" data.frame  into a train sample (model estimation)   and test sample (to calculate and compare Salary predictions)

```{r}
set.seed(1)
train <- sample(c(TRUE,FALSE), nrow(Hitters),rep=TRUE)
test <- (!train)
```

```{r}
ols.fit2 <- lm(Salary~., data=Hitters, subset=train)
summary(ols.fit2)
ols.fitted2 <- predict(ols.fit2, newdata=Hitters[test==T,])
#ols.fitted2
(MSE.ols2 <- mean((ols.fitted2 - Hitters$Salary[test==T])^2))
```


```{r}
set.seed(1)
pcr.fit2 <- pcr(Salary~., data=Hitters, subset=train, scale=TRUE, validation="CV")
validationplot(pcr.fit2,val.type="RMSEP") # choose ncomp = 5
?predict.mvr
pcr.pred2 <- predict(pcr.fit2,Hitters[test==T, ], ncomp=5, type="response")
MSE.pcr2 <- mean((pcr.pred2 - Hitters$Salary[test==T])^2)
```

```{r}
MSE.ols2
MSE.pcr2
```


As far as true predictions (test sample predictions) are concerned, 5-component PCR outperforms OLS

```{r}
BIC(ols.fit2)
try(BIC(pcr.fit2))
```


 BIC is not applicable to PCR objects as we have no usable information
 on model "complexity" .. OLS and PCR models may not be compared using Information Criteria

### Note 2
 The train sample / test sample setup as shown in rows 61 - 63
 is arbitrary and potentially not-representative.
 .. k-Fold Cross Validation may be used.


 See ISLR, Chapters 6.3 and 10.2 for a general discussion of PCR and PCA.