diff --git a/RworkshopI.Rmd b/RworkshopI.Rmd index e689ea3..b79942f 100644 --- a/RworkshopI.Rmd +++ b/RworkshopI.Rmd @@ -3,31 +3,24 @@ title: "Hello, R!" author: "Yue Hu's R Workshop Series I" output: ioslides_presentation: - incremental: yes + incremental: true logo: image/logo.gif self_contained: yes slidy_presentation: null transition: faster widescreen: yes --- - - # Preface ## What are covered + +*Fall* + * A overview of R * Data manipulation (input/output, row/column selections, etc.) -* Descriptive and binary hypotheses (summary, correlation, t-test, etc.) -* Multiple regression (OLS, GLS, MLM, etc.) -* Presentation (table, graph) -* Version control (if we have time) +* Quantitative Analysis +* Basic Data Visualization + # An Overview of R @@ -40,46 +33,47 @@ You can use R to: * Create presentation slides in pdf (as LaTex beamer) or html (as Markdown) * Create webpages * Write academic articles and save them in html, pdf, or word. -* Write a book (see e.g., [bookdown](https://bookdown.org/home/).) +* Write an academic artical or book (see e.g., [bookdown](https://bookdown.org/home/).) + +## Why R rather than the others?? -## Why R rather than the others?? {.columns-2 .build} +
* It's free!! * It's developing! + R is very compatible with new techniques + e.g., Network analysis, spatial analysis with GIS, and text analysis with big data. -* It's multi-lingual! +* It's multi-lingual!: `"Hello 你好 здравствуйте"` -```{r eval = F} -"Hello 你好 안녕하세요 здравствуйте" -``` -

+
+
* It's popular! + [Magoulas & King, 2014, *Data Science Salary Survey*.](http://www.oreilly.com/data/free/stratasurvey.csp) +
## A Trade-Off of the Great Power
- +
Source: NYU Data Services. ## Software installations -* Software installation (Tip) - + [![R](https://www.r-project.org/Rlogo.png)](https://www.r-project.org/) - + [![Rstudio](https://www.rstudio.com/wp-content/uploads/2014/03/blue-250.png)](https://www.rstudio.com/products/rstudio/download/preview/) +* Software installation (Tip) + + [![R](image/Rlogo.png)](https://www.r-project.org/) + + [![Rstudio](image/rStudioLogo.png)](https://www.rstudio.com/products/rstudio/download/preview/)
Install R before Rstudio; so does in updating. Using the [Rstudio preview](https://www.rstudio.com/products/rstudio/download/preview/).
-## Package installation and loading -* Packages are "Apps" for R. +## Package installation and loading{.build} +* Packages are "Apps" for R. + `install.packages()` + `install_github("")` * Find instructions from package repositories: @@ -96,12 +90,17 @@ Using the [Rstudio preview](https://www.rstudio.com/products/rstudio/download/pr # Math and Basic Statistics with R ## Set where to locate the data and store the results -* Always check or set the working directory first +
+* Always check or set the working directory first + `getwd()` + `setwd("E:/R workshop/rworkshop")` +
+ +
* Or click, click, and click in Rstudio - + - + + +
+ ## Terms of R in plain English {.columns-2} * **Object**: packing things together and naming it @@ -118,10 +117,10 @@ Using the [Rstudio preview](https://www.rstudio.com/products/rstudio/download/pr

* **Array**: a multi-dimension matrix - + + + + one-dimension array == vector + two-dimension array == matrix -* **Function**: a process to handle the object +* **Function**: a process to handle the object (`functionName(varLabel = varData)`) ## Do math with R: Basic Functions @@ -136,8 +135,8 @@ x^2;sqrt(x);log(x);exp(x) # matrix algebra z <- matrix(1:4, ncol = 2) z + z - z -z %*% z # inner mulTiplication -z %o% z # outter mulTiplication +z %*% z # inner mulTiplication +z %o% z # outter mulTiplication # logical evaluation x == z; x != Z @@ -146,16 +145,17 @@ x > z; x <= z ``` -## Commen Data Type: Vector{.smaller} +## Common Data Type: Vector{.smaller} ```{r} 1:10 # numeric (integer/double) c("R", "workshop") # character 3 == 5 # logical factor(1:3, levels = 1:3, labels = c("low", "medium", "high")) # factor ``` -(Tip) +(Tip) +
-The `factor` is a R *function*. Ususally, the first component ("`1:3`") of a R function is the object, the target this function is going to work on. The rest components ("`levels = 1:3, labels = c("low", "medium", "high")`") are arguments, with which setting the special conditions the object is dealt. +The `factor` is a R *function*. Ususally, the first component ("`1:3`") of a R function is the object, the target this function is going to work on. The rest components ("`levels = 1:3, labels = c("low", "medium", "high")`") are arguments, with which setting the special conditions the object is dealt. If you are not sure about the utility of certain arguments, ask R for help by `?`, e.g., @@ -165,7 +165,7 @@ If you are not sure about the utility of certain arguments, ask R for help by `?
-## Commen Data Type: Dataset {.smaller} +## Common Data Type: Dataset {.smaller} ```{r} matrix(1:4, ncol = 2) # matrix @@ -206,7 +206,7 @@ Basic rules for object name: str(df) ``` -* Unique values (Tip) +* Unique values (Tip) ```{r} unique(df$x) @@ -267,228 +267,13 @@ is.na(x) # detect if x includes missing values * Four types of data: numeric, character, logical, factor * Four types of datasets: matrix, data.frame, list, array * Save the things into an object by `<-` -
-
-
-
-
-
- -

- -* Next: Data input - + - - - -# Data Input and Manipulation -## Input default data types{.build} - -* Default data types: .Rds, .Rdata(.Rda) - -```{r eval=FALSE} -load(".RData") - -df_txt <- read.table(".txt") -df_csv <- read.csv(".csv") - -``` - -* Some data are already embedded in R. To call them, use `data()`, e.g. - -```{r eval=FALSE} -data(mtcars) -``` - - -## Input data with packages -```{r eval=FALSE} -# SPSS, Stata, SAS -library(haven) -df_spss <- read_spss(".sav") -df_stata <- read_dta(".dta") -df_sas <- read_sas(".sas7bdat") - -# Excel sheets -library(readxl) -df_excel <- read_excel(".xls");read_excel(".xlsx") - -# JavaScript Object Notation -library(rjson) -df_json <- fromJSON(file = ".json" ) - -# XML/Html -df_xml <- xmlTreeParse("") -df_html <- readHTMLTable(url, which=3) - -``` - - -## Output data{.build} - -* Save in a R dataset (`.RData`) - -```{r eval = F} -save(object, file = "./Data/mydata.Rdata") -``` - -* Save as `.csv` - -```{r eval = F} -write.csv(object, file = "mydata.csv") -``` - -* Save as `.feather` (Tip) - -```{r eval=F} -feather::write_feather(mydata, "mydata.feather") -``` - -
-Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames, which can be read by both R and Python. -See more details in [Feather](https://blog.rstudio.org/2016/03/29/feather/). -
- - -## Manipulate the data{.build} -* let's call a dataset first, - -```{r} -data(mtcars) -``` - -* Variable numbers and Observations - -```{r} -ncol(mtcars);names(mtcars) -nrow(mtcars) -``` - - -## Have a glimpse. -```{r} -dplyr::glimpse(mtcars) -``` - ----- -```{r} -head(mtcars) # show the first six lines of mtcars -``` - -## Let's zoom in!{.build} -* locate a specific row, column, or cell of data: `data[row#, col#]` or `data["rowName","colName"]`. - -```{r} -mtcars[1:2,3:4] # show first and the second rows of the third and fourth columns -``` - -```{r eval=FALSE} -mtcars[ ,"mpg"] # show the column "mpg" -mtcars[ ,"mpg"][3] -``` - ----- - -Select with special conditions - -```{r} -mtcars[mtcars$mpg < 20,][1,] # show the first rows which mpg are below 5. -``` - -Create new rows/columns - -```{r} -mtcars$id <- seq(1:nrow(mtcars)) -``` - - -## Let's generalize!{.build} -* Summarise vector in categories - -```{r} -unique(mtcars$cyl) -table(mtcars$cyl) -``` - ----- - -For a dataset or a numeric vector - -```{r} -summary(mtcars$cyl) -``` - -One can use `mean`, `sd`, `max`, `min`, etc. to extract specific descriptive statistics. - -```{r} -mean(mtcars$cyl) -``` - -## Let's create!{.build} -* Create a variable into the dataset (Tip) - -```{r} -mtcars$newvar <- c(1:nrow(mtcars)) # create an "ID" variable -mtcars$newvar -``` - -
-Obviously, variables can be immediately overwrite without any specific setting. - -It is convenient but also risky. -
- -* Remove a variable from the dataset - -```{r} -mtcars$newvar <- NULL -mtcars$newvar -``` - ----- - -Remove variable, result, function, or data from the environment - -```{r eval=FALSE} -rm(x) -``` - -Recode a variable: e.g., numeric to binary, mpg > mean, 1, otherwise 0 - -```{r eval=FALSE} -# Method I -mtcars$newvar[mtcars$mpg > mean(mtcars$mpg)] <- 1 -mtcars$newvar[mtcars$mpg <= mean(mtcars$mpg)] <- 0 - -# Method II -mtcars$newvar <- ifelse(mtcars$mpg > mean(mtcars$mpg), 1, 0) # overwrite the NAs -``` - - -## Wrap Up -* Input/output: `load()`/`read.`series and `save()`/`write.`series -* A glimpse of data: `head()` or `dplyr::glimpse` -* Description: `summary()`, `table()` - + More specific: `mean`, `sd`, `max`, `min`, etc. -* Manipulation: - + create: `mtcars$newvar <- c(1:nrow(mtcars))` - + Remove: `mtcars$newvar <- NULL`; `rm()` - + Recode: `recodevar[] <- ` -* There are also [`apply` family](http://www.r-bloggers.com/r-tutorial-on-the-apply-family-of-functions/) functions for with batching management of data. - - -## Next lecture: Hypothsis test - -
-![](http://mathsupport.mas.ncl.ac.uk/images/d/d0/95contint.gif) -
## See you then ~ -
-![](http://rescuethepresent.net/tomandjerry/files/2016/05/16-thanks.gif) +
+
diff --git a/RworkshopII.Rmd b/RworkshopII.Rmd index 08b2047..aae6f78 100644 --- a/RworkshopII.Rmd +++ b/RworkshopII.Rmd @@ -1,325 +1,496 @@ --- title: "Hello, R!" -author: "Yue Hu's R Workshop Series II" +author: "Yue Hu's R Workshop Series I" output: ioslides_presentation: - self_contained: yes + incremental: true logo: image/logo.gif + self_contained: yes + slidy_presentation: null transition: faster widescreen: yes - slidy_presentation: - incremental: yes --- + # Preface -## What Are Covered in This Workshop Series -* [A overview of R](https://rpubs.com/sammo3182/Rintro) -* [Data manipulation (input/output, row/column selections, etc.)](https://rpubs.com/sammo3182/Rintro) -* **Descriptive and binary hypotheses (summary, correlation, t-test, etc.)** -* **Multiple regression (OLS, GLS, MLM, etc.)** -* Multilevel Regression -* Presentation (table, graph) +## What are covered +*Fall* +* A overview of R +* Data manipulation (input/output, row/column selections, etc.) +* Quantitative Analysis +* Basic Data Visualization -# Hypothesis Tests -## package loading -You want the `pacman` package to load multiple packages. -```{r} -pacman::p_load(dplyr) -``` +# An Overview of R +## Why using R in your research? +You can use R to: -## Data Glimpse -```{r} -data("mtcars") -dplyr::glimpse(mtcars) -``` +* Do statistics and solve math problems +* Edit codes in Excel, Python, C++, ... +* Scrape data from texts, websites, databases, pdf... +* Create presentation slides in pdf (as LaTex beamer) or html (as Markdown) +* Create webpages +* Write academic articles and save them in html, pdf, or word. +* Write an academic artical or book (see e.g., [bookdown](https://bookdown.org/home/).) + + + +## Why R rather than the others?? + +
+* It's free!! +* It's developing! + + R is very compatible with new techniques + + e.g., Network analysis, spatial analysis with GIS, and text analysis with big data. +* It's multi-lingual!: `"Hello 你好 здравствуйте"` + + +
+ +
+* It's popular! + + +[Magoulas & King, 2014, *Data Science Salary Survey*.](http://www.oreilly.com/data/free/stratasurvey.csp) +
+ + +## A Trade-Off of the Great Power + +
+ +
Source: NYU Data Services. + +## Software installations +* Software installation (Tip) + + [![R](image/Rlogo.png)](https://www.r-project.org/) + + [![Rstudio](image/rStudioLogo.png)](https://www.rstudio.com/products/rstudio/download/preview/) + +
+Install R before Rstudio; so does in updating. +Using the [Rstudio preview](https://www.rstudio.com/products/rstudio/download/preview/). +
+ +## Package installation and loading{.build} +* Packages are "Apps" for R. + + `install.packages()` + + `install_github("")` +* Find instructions from package repositories: + + [An example](https://github.com/sammo3182/interplot) +* Click the apps: Load the package + + `library()` + + `require()` + +## RStudio{.flexbox .vcenter} + + + + +# Math and Basic Statistics with R +## Set where to locate the data and store the results + +
+* Always check or set the working directory first + + `getwd()` + + `setwd("E:/R workshop/rworkshop")` +
+ +
+* Or click, click, and click in Rstudio + + +
-## Binary Tests: Difference in mean +## Terms of R in plain English {.columns-2} +* **Object**: packing things together and naming it +* **Vector**: + + Mathematics: a one-column matrix + + Practice: a single variable +* **Factor**: + + A special vector + + Special for ordinal or mulinomial variable +* **Matrix** vis-a-vis **Data frame** + + Matrix is a pile of numbers + + Data frame is a dataset + +

+ +* **Array**: a multi-dimension matrix + + + + one-dimension array == vector + + two-dimension array == matrix +* **Function**: a process to handle the object (`functionName(varLabel = varData)`) + + +## Do math with R: Basic Functions + +```{r eval=FALSE} +# basic math +x + (1 - 2) * 3 / 4 + +# advanced math +x^2;sqrt(x);log(x);exp(x) + +# matrix algebra +z <- matrix(1:4, ncol = 2) +z + z - z +z %*% z # inner mulTiplication +z %o% z # outter mulTiplication + +# logical evaluation +x == z; x != Z +x & z; x | z +x > z; x <= z +``` -$H_{0}: \bar{cylinders} = \bar{gears},\ \alpha = .05$ +## Common Data Type: Vector{.smaller} ```{r} -t.test(mtcars$cyl, mtcars$gears) +1:10 # numeric (integer/double) +c("R", "workshop") # character +3 == 5 # logical +factor(1:3, levels = 1:3, labels = c("low", "medium", "high")) # factor ``` +(Tip) ----- +
+The `factor` is a R *function*. Ususally, the first component ("`1:3`") of a R function is the object, the target this function is going to work on. The rest components ("`levels = 1:3, labels = c("low", "medium", "high")`") are arguments, with which setting the special conditions the object is dealt. -`t.test` offers arguments `alternative`, `mu`, `paired`, and `conf.level` for users to change in two-tail/one-tail test, parameter mean, independent/paired comparison, and $\alpha$. +If you are not sure about the utility of certain arguments, ask R for help by `?`, e.g., ```{r eval=FALSE} -# one side, cyl > gear, alpha = .01 -t.test(mtcars$cyl, mtcars$gear, - alternative = "greater", conf.level = .99)) +?factor +``` +
+ -# comparing with the parameter (true value) -t.test(mtcars$cyl, mu = 6) # the true mean is 6. +## Common Data Type: Dataset {.smaller} +```{r} +matrix(1:4, ncol = 2) # matrix +data.frame(x = 1:2, y = 3:4) # data.frame +list(c("one", "two"), c(3, 4)) # 2-D list ``` +---- +```{r} +array(c(1:8), dim = c(2, 2, 3)) # 3-D or n-D "list"" +``` -## Binary Tests: Correlation -$H_{0}: \rho_{(cyl,gear)} = 0,\ \alpha = .05$ +## Save data to an object ```{r} -cor.test(mtcars$cyl, mtcars$gear) +x <- rep(c(.01, .05, .1), times = 2) # repeat 1:5 for twice +df <- data.frame(x = 1:1, y = 3:4) +list <- list(x, df) + +list # == print(list) ``` ---- -`cor.test` offers various arguments as in `t.test` for more specific settings. Moreover, users can use the `method` argument to set the method to calculate the correlations, "Pearson", "Kendall", or "Spearman." (Tip) +Basic rules for object name: + +* Don't start with numbers (WRONG: `1stday`) +* No special signs except for `.` and `-` (WRONG: `M&M`) +* Case sensitivity (`X != x`) + + + + +## Attributes of an object {.smaller .build} +* Structure ```{r} -cor.test(mtcars$cyl, mtcars$gear, method = "kendall") +str(df) ``` +* Unique values (Tip) + +```{r} +unique(df$x) +```
-Do I have to type the `mtcars$` every time? +What is the `$`? It is used to call specific columns a data.frame. -* No you don't. - + It offers a potential for cross-dataset operation, though. - + Use `within()`: e.g., `within(mtcars, cor.test(cyl, gear))` - + Use `attach()` (not recommonded) -
+To call the components in a vector, we use "[]" +```{r} +x[3] +``` ----- +To call the components in a list, we use "[[]]" -We can get the correlation matrix, too: ```{r} -cor(mtcars[,1:4]) +list[[2]] ``` -## Present the correlations -You want the `corrplot` package. -```{r fig.height=4} -cor(mtcars) %>% corrplot::corrplot() -``` +
----- +* Names -Or a mixed format: ```{r} -cor(mtcars) %>% corrplot::corrplot.mixed() +names(df) ``` +---- -## Binary Tests: ANOVA {.smaller} -One way or two way ANOVA: +Length ```{r} -aov_one <- aov(cyl ~ gear, data = mtcars) #one-way +length(x) +``` -aov_two <- aov(cyl ~ gear + am, data = mtcars) #two-way +Class -summary(aov_one); summary(aov_two) +```{r} +class(x);typeof(x) # ; is used to write two commands in one line ``` +## Detect the attributes +Using `is.` -## Wrap up -* T-test: `t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, conf.level = 0.95, ...)` +```{r} +x <-c(1, 2, NA, 4) +is.numeric(x) +is.na(x) # detect if x includes missing values +``` -* Correlation: `cor.test(x, y, alternative = c("two.sided", "less", "greater"), method = c("pearson", "kendall", "spearman"), conf.level = 0.95, continuity = FALSE, ...) ` -* ANOVA: `aov(formula, data = NULL, ...)` ----- -Next: Multiple regression +## Wrap up {.columns-2} -
-![core](http://www.math.yorku.ca/SCS/spida/lm/mreganim3.gif) -
+* Set the working directory first: `setwd()` +* Four types of data: numeric, character, logical, factor +* Four types of datasets: matrix, data.frame, list, array +* Save the things into an object by `<-` +
+
+
+
+
+
+

+ +* Next: Data input + + -# Multiple Regression -## Ordinary Linear Regression -$Mileage = \beta_0cylinders + \beta_1horsepower + \beta_3weight + \varepsilon$ -```{r} -lm_ols <- lm(mpg ~ cyl + hp + wt, data = mtcars) -``` -* `lm_ols`: Object name -* `mpg`: Dependent variable -* `cyl + hp + wt`: Independent variables -* `data = mtcars`: Where the variables are stored +# Data Input and Manipulation +## Input default data types{.build} +* Default data types: .Rds, .Rdata(.Rda) -## Result{.smaller} +```{r eval=FALSE} +load(".RData") + +df_txt <- read.table(".txt") +df_csv <- read.csv(".csv") -```{r} -summary(lm_ols) ``` -## Nonlinear transition -ln, square, exponential, or inverse +* Some data are already embedded in R. To call them, use `data()`, e.g. -```{r} -lm_tran <- lm(log(mpg) ~ I(cyl^2) + exp(hp) + I(1/wt), data = mtcars) +```{r eval=FALSE} +data(mtcars) ``` -* `log(mpg)`: logistic -* `I(cyl^2), I(1/wt)`: square, inverse -* `exp(hp)`: exponential -## The result {.smaller} +## Input data with packages +```{r eval=FALSE} +# SPSS, Stata, SAS +library(haven) +df_spss <- read_spss(".sav") +df_stata <- read_dta(".dta") +df_sas <- read_sas(".sas7bdat") -```{r} -summary(lm_tran) -``` +# Excel sheets +library(readxl) +df_excel <- read_excel(".xls");read_excel(".xlsx") -## Adding binary variables +# JavaScript Object Notation +library(rjson) +df_json <- fromJSON(file = ".json" ) -When the model including binary variables based on a factor +# XML/Html +df_xml <- xmlTreeParse("") +df_html <- readHTMLTable(url, which=3) -```{r} -mtcars$gear_f <- factor(mtcars$gear, levels = 3:5, labels = c("3-gear", "4-gear", "5-gear")) -table(mtcars$gear) -table(mtcars$gear_f); class(mtcars$gear_f) ``` -## The result {.smaller} -```{r} -lm_f <- lm(mpg ~ cyl + hp + wt + gear_f, data = mtcars) -summary(lm_f) +## Output data{.build} + +* Save in a R dataset (`.RData`) + +```{r eval = F} +save(object, file = "./Data/mydata.Rdata") ``` +* Save as `.csv` -## Interaction -Two-way interaction: horsepower * Weight +```{r eval = F} +write.csv(object, file = "mydata.csv") +``` -```{r} -lm_in <- lm(mpg ~ cyl + hp * wt, data = mtcars) +* Save as `.feather` (Tip) +```{r eval=F} +feather::write_feather(mydata, "mydata.feather") ``` -Equivalent to `lm_in2 <- lm(mpg ~ cyl + hp + wt + hp:wt, data = mtcars)` +
+Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames, which can be read by both R and Python. +See more details in [Feather](https://blog.rstudio.org/2016/03/29/feather/). +
+ +## Manipulate the data{.build} +* let's call a dataset first, -## The result {.smaller} ```{r} -summary(lm_in) +data(mtcars) ``` +* Variable numbers and Observations - -## Post-estimate diagnoses: Residural - -```{r fig.height=3.5, fig.align="center"} -res <- resid(lm_ols); res[1:4] -plot(lm_ols, which = 1) # residural vs. fitted plot +```{r} +ncol(mtcars);names(mtcars) +nrow(mtcars) ``` -## Post-estimate diagnoses: Outliers + +## Have a glimpse. ```{r} -car::outlierTest(lm_ols) # Bonferonni p-value for most extreme obs +dplyr::glimpse(mtcars) ``` ---- ```{r} -car::qqPlot(lm_ols) #qq plot for studentized resid +head(mtcars) # show the first six lines of mtcars +``` + +## Let's zoom in!{.build} +* locate a specific row, column, or cell of data: `data[row#, col#]` or `data["rowName","colName"]`. + +```{r} +mtcars[1:2,3:4] # show first and the second rows of the third and fourth columns ``` +```{r eval=FALSE} +mtcars[ ,"mpg"] # show the column "mpg" +mtcars[ ,"mpg"][3] +``` + +---- + +Select with special conditions + +```{r} +mtcars[mtcars$mpg < 20,][1,] # show the first rows which mpg are below 5. +``` -## Post-estimate diagnoses: CLRM Properties{.build} -* Heteroscedasticity +Create new rows/columns ```{r} -car::ncvTest(lm_ols) +mtcars$id <- seq(1:nrow(mtcars)) ``` -* Multicollinearity + +## Let's generalize!{.build} +* Summarise vector in categories ```{r} -car::vif(lm_ols) +unique(mtcars$cyl) +table(mtcars$cyl) ``` ---- -Autocorrelation +For a dataset or a numeric vector ```{r} -car::durbinWatsonTest(lm_ols) +summary(mtcars$cyl) ``` +One can use `mean`, `sd`, `max`, `min`, etc. to extract specific descriptive statistics. + +```{r} +mean(mtcars$cyl) +``` -## Logit -$vs = \frac{1}{1 + e^{-(\beta_0 + \beta_1cylinder + \beta_2horsepower + \beta_3weight + \varepsilon)}}$ +## Let's create!{.build} +* Create a variable into the dataset (Tip) ```{r} -logit <- glm(vs ~ cyl + hp + wt, data = mtcars, family = "binomial") +mtcars$newvar <- c(1:nrow(mtcars)) # create an "ID" variable +mtcars$newvar ``` -MLE on other distributions: change the value of the argument `family` to `Gamma`, `poisson`, `gaussian`, etc. +
+Obviously, variables can be immediately overwrite without any specific setting. + +It is convenient but also risky. +
-## The result{.smaller} +* Remove a variable from the dataset ```{r} -summary(logit) +mtcars$newvar <- NULL +mtcars$newvar +``` + +---- + +Remove variable, result, function, or data from the environment + +```{r eval=FALSE} +rm(x) ``` +Recode a variable: e.g., numeric to binary, mpg > mean, 1, otherwise 0 -## Interpretation: Margin +```{r eval=FALSE} +# Method I +mtcars$newvar[mtcars$mpg > mean(mtcars$mpg)] <- 1 +mtcars$newvar[mtcars$mpg <= mean(mtcars$mpg)] <- 0 -```{r message=FALSE} -library(mfx) -logit_m <- logitmfx(vs ~ cyl + hp + wt, data = mtcars) -logit_m +# Method II +mtcars$newvar <- ifelse(mtcars$mpg > mean(mtcars$mpg), 1, 0) # overwrite the NAs ``` -## Interpretation: Predicted probability -Predicted Probability when `cyl` changes from 4 to 6. -```{r} -# Step 1: creat an aggregate data -mtcars_fake <- with(mtcars, data.frame(cyl = 4:6, hp = mean(hp), wt = mean(wt))) -# Step 2: predict based on the new data -logit_pp4 <- cbind(mtcars_fake,predict(logit, newdata = mtcars_fake, type = "link", se = TRUE)) -# Step 3: convert to probability -logit_pp4 <- within(logit_pp4, {pp <- plogis(fit) - lb <- plogis(fit - 1.96 * se.fit) - ub <- plogis(fit + 1.96 * se.fit)}) -logit_pp4[,7:9] -``` - - -## Wrap Up -* OLS: `lm(Y ~ X, data = data)` - + Non-linear transformations: `I(X^2)`, `exp(X)`, `log(X)`. - + Using factor variable: R will handle that for you. - + Interaction: `lm(Y ~ X * Z, data = data)`. - + Post-estimate diagnoses: `resid()`, `outlierTest()`, `qqPlot()`, `ncvTest()`, `vif()`, `durbinWatsonTest()` -* Logit: `glm(Y ~ X, data = data, family = "binomial")` - + Margins: using `mfx::logitmfx` - + Predict probabilty: - + Step 1: create an aggregate data - + Step 2: predict the log odds - + Step 3: transfer to probability - ----- +## Wrap Up +* Input/output: `load()`/`read.`series and `save()`/`write.`series +* A glimpse of data: `head()` or `dplyr::glimpse` +* Description: `summary()`, `table()` + + More specific: `mean`, `sd`, `max`, `min`, etc. +* Manipulation: + + create: `mtcars$newvar <- c(1:nrow(mtcars))` + + Remove: `mtcars$newvar <- NULL`; `rm()` + + Recode: `recodevar[] <- ` +* There are also [`apply` family](http://www.r-bloggers.com/r-tutorial-on-the-apply-family-of-functions/) functions for with batching management of data. -Next: Presenting with R + +## Next lecture: Hypothsis test
+![](http://mathsupport.mas.ncl.ac.uk/images/d/d0/95contint.gif) +
- - - - ## See you then ~ -
- - +
+![](http://rescuethepresent.net/tomandjerry/files/2016/05/16-thanks.gif) +
-
## External Sources * My email: [yue-hu-1@uiowa.edu](mailto: yue-hu-1@uiowa.edu) @@ -337,3 +508,6 @@ Next: Presenting with R + http://shiny.stat.ubc.ca/r-graph-catalog/ + + + diff --git a/RworkshopIII.Rmd b/RworkshopIII.Rmd index 1a5e38d..08b2047 100644 --- a/RworkshopIII.Rmd +++ b/RworkshopIII.Rmd @@ -1,289 +1,331 @@ --- title: "Hello, R!" -author: "Yue Hu's R Workshop Series III" +author: "Yue Hu's R Workshop Series II" output: ioslides_presentation: - incremental: yes + self_contained: yes logo: image/logo.gif - slidy_presentation: null transition: faster widescreen: yes + slidy_presentation: + incremental: yes --- +# Preface +## What Are Covered in This Workshop Series +* [A overview of R](https://rpubs.com/sammo3182/Rintro) +* [Data manipulation (input/output, row/column selections, etc.)](https://rpubs.com/sammo3182/Rintro) +* **Descriptive and binary hypotheses (summary, correlation, t-test, etc.)** +* **Multiple regression (OLS, GLS, MLM, etc.)** +* Multilevel Regression +* Presentation (table, graph) -## Tabling -There are over twenty packages for [table presentation](http://conjugateprior.org/2013/03/r-to-latex-packages-coverage/) in R. My favoriate three are `stargazer`, `xtable`, and `texreg`. -(Sorry, but all of them are for **Latex** output) -* `stargazer`: good for summary table and regular regression results -* `texreg`: when some results can't be presented by `stargazer`, try `texreg` (e.g., MLM results.) -* `xtable`: the most extensively compatible package, but need more settings to get a pretty output, most of which `stargazer` and `texreg` can automatically do for you. +# Hypothesis Tests -## An example {.smaller .columns-2} +## package loading +You want the `pacman` package to load multiple packages. -```{r message = F} -lm_ols <- lm(mpg ~ cyl + hp + wt, data = mtcars) -stargazer::stargazer(lm_ols, type = "text", align = T) +```{r} +pacman::p_load(dplyr) ``` ----- +## Data Glimpse +```{r} +data("mtcars") +dplyr::glimpse(mtcars) +``` -Present in PDF -
- -
- -* For the users of MS Word, click [here](http://www.r-statistics.com/2010/05/exporting-r-output-to-ms-word-with-r2wd-an-example-session/). +## Binary Tests: Difference in mean +$H_{0}: \bar{cylinders} = \bar{gears},\ \alpha = .05$ -## But...why tabulating if you can plot? -Three types of graphic presenting approaches in R: +```{r} +t.test(mtcars$cyl, mtcars$gears) +``` + +---- -* Basic plots: `plot()`. -* Lattice plots: e.g., `ggplot()`. -* Interactive plots: `shiny()`. (save for later) - +
- -
+`t.test` offers arguments `alternative`, `mu`, `paired`, and `conf.level` for users to change in two-tail/one-tail test, parameter mean, independent/paired comparison, and $\alpha$. -## Basic plot -Pro: +```{r eval=FALSE} +# one side, cyl > gear, alpha = .01 +t.test(mtcars$cyl, mtcars$gear, + alternative = "greater", conf.level = .99)) -* Embedded in R -* Good tool for data exploration. -* Spatial analysis and 3-D plots. +# comparing with the parameter (true value) +t.test(mtcars$cyl, mu = 6) # the true mean is 6. -Con: +``` -* Not very pretty -* Not very flexible -## An example: create a histogram +## Binary Tests: Correlation +$H_{0}: \rho_{(cyl,gear)} = 0,\ \alpha = .05$ -```{r fig.align="center"} -hist(mtcars$mpg) +```{r} +cor.test(mtcars$cyl, mtcars$gear) ``` -## Saving the plot{.build} -* Compatible format:`.jpg`, `.png`, `.wmf`, `.pdf`, `.bmp`, and `postscript`. -* Process: - 1. call the graphic device - 2. plot - 3. close the device - -```{r eval = F} -jpeg("histgraph.jpg") -hist -dev.off() +---- + +`cor.test` offers various arguments as in `t.test` for more specific settings. Moreover, users can use the `method` argument to set the method to calculate the correlations, "Pearson", "Kendall", or "Spearman." (Tip) + +```{r} +cor.test(mtcars$cyl, mtcars$gear, method = "kendall") ``` -Tip
-Sometimes, RStudio may distort the graphic output. In this situation, try to zoom or use `windows()` function. +Do I have to type the `mtcars$` every time? + +* No you don't. + + It offers a potential for cross-dataset operation, though. + + Use `within()`: e.g., `within(mtcars, cor.test(cyl, gear))` + + Use `attach()` (not recommonded)
+ ---- -The device list: +We can get the correlation matrix, too: +```{r} +cor(mtcars[,1:4]) +``` + +## Present the correlations +You want the `corrplot` package. +```{r fig.height=4} +cor(mtcars) %>% corrplot::corrplot() +``` -| Function | Output to | -|----------------------------- |------------------ | -| pdf("mygraph.pdf") | pdf file | -| win.metafile("mygraph.wmf") | windows metafile | -| png("mygraph.png") | png file | -| jpeg("mygraph.jpg") | jpeg file | -| bmp("mygraph.bmp") | bmp file | -| postscript("mygraph.ps") | postscript file | +---- +Or a mixed format: +```{r} +cor(mtcars) %>% corrplot::corrplot.mixed() +``` -## `ggplot`: the most popular graphic engine in R {.build} -+ Built by Hadley Wickham based on Leland Wilkinson's *Grammar of Graphics*. -+ It breaks the plot into components as scales and layers---increase the flexibility. -+ To use `ggplot`, one needs to install the package `ggplot2` first. +## Binary Tests: ANOVA {.smaller} +One way or two way ANOVA: -```{r message=FALSE} -library(ggplot2) -``` +```{r} +aov_one <- aov(cyl ~ gear, data = mtcars) #one-way +aov_two <- aov(cyl ~ gear + am, data = mtcars) #two-way -## Histogram in `ggplot` -```{r fig.align="center", fig.height=2.7} -ggplot(mtcars, aes(x=mpg)) + - geom_histogram(aes(y=..density..), binwidth=2, colour="black") +summary(aov_one); summary(aov_two) ``` -## Decoration -```{r fig.align="center", fig.height=2.7} -ggplot(mtcars, aes(x=mpg)) + - geom_histogram(aes(y=..density..), binwidth=2, colour="black", fill="purple") + - geom_density(alpha=.2, fill="blue") + # Overlay with transparent density plot - theme_bw() + ggtitle("histogram with a Normal Curve") + - xlab("Miles Per Gallon") + ylab("Density") -``` +## Wrap up +* T-test: `t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, conf.level = 0.95, ...)` -## Break in Parts:{.smaller} +* Correlation: `cor.test(x, y, alternative = c("two.sided", "less", "greater"), method = c("pearson", "kendall", "spearman"), conf.level = 0.95, continuity = FALSE, ...) ` -```{r eval=FALSE} -ggplot(data = mtcars, aes(x=mpg)) + - geom_histogram(aes(y=..density..), binwidth=2, colour="black", fill="purple") + - geom_density(alpha=.2, fill="blue") + # Overlay with transparent density plot - theme_bw() + ggtitle("histogram with a Normal Curve") + - xlab("Miles Per Gallon") + ylab("Density") -``` -* `data`: The data that you want to visualise - -* `aes`: Aesthetic mappings -describing how variables in the data are mapped to aesthetic attributes - + horizontal position (`x`) - + vertical position (`y`) - + colour - + size -* `geoms`: Geometric objects that represent what you actually see on -the plot - + points - + lines - + polygons - + bars +* ANOVA: `aov(formula, data = NULL, ...)` ---- -* `theme`, `ggtitle`, `xlab`, `ylab`: decorations. -* Other parts you may see in some developed template - + `stats`: Statistics transformations - + `scales`: relate the data to the aesthetic - + `coord`: a coordinate system that describes how data coordinates are -mapped to the plane of the graphic. - + `facet`: a faceting specification describes how to break up the data into sets. +Next: Multiple regression +
+![core](http://www.math.yorku.ca/SCS/spida/lm/mreganim3.gif) +
-## Save `ggplot` -* `ggsave(, "")`: - + When the `` is omitted, R will save the last presented plot. - + There are additional arguments which users can use to adjust the size, path, scale, etc. +# Multiple Regression +## Ordinary Linear Regression +$Mileage = \beta_0cylinders + \beta_1horsepower + \beta_3weight + \varepsilon$ +```{r} +lm_ols <- lm(mpg ~ cyl + hp + wt, data = mtcars) +``` -## Plotting with packages: Map +* `lm_ols`: Object name +* `mpg`: Dependent variable +* `cyl + hp + wt`: Independent variables +* `data = mtcars`: Where the variables are stored -```{r eval=FALSE} -starbucks <- read.csv("https://opendata.socrata.com/api/views/ddym-zvjk/rows.csv?accessType=DOWNLOAD") +## Result{.smaller} -library(leaflet) -leaflet() %>% addTiles() %>% - setView(-91.535632, 41.660965, zoom = 16) %>% - addMarkers(data = starbucks, lat = ~Latitude, lng = ~Longitude, popup = starbucks$Name) +```{r} +summary(lm_ols) ``` ----- +## Nonlinear transition +ln, square, exponential, or inverse +```{r} +lm_tran <- lm(log(mpg) ~ I(cyl^2) + exp(hp) + I(1/wt), data = mtcars) +``` -```{r two-column, echo=FALSE, results = 'asis', out.extra = '', cache=TRUE} -starbucks <- read.csv("https://opendata.socrata.com/api/views/ddym-zvjk/rows.csv?accessType=DOWNLOAD") +* `log(mpg)`: logistic +* `I(cyl^2), I(1/wt)`: square, inverse +* `exp(hp)`: exponential +## The result {.smaller} -library(leaflet) -leaflet() %>% addTiles() %>% - setView(-91.535632, 41.660965, zoom = 16) %>% - addMarkers(data = starbucks, lat = ~Latitude, lng = ~Longitude, popup = starbucks$Name) +```{r} +summary(lm_tran) ``` +## Adding binary variables -## Plotting with packages: `dotwhisker`{.smaller} -Plot the comparable coefficients or other estimates (margins, predicted probabilities, etc.). +When the model including binary variables based on a factor -```{r message=FALSE} -library(dotwhisker) -library(broom) -lm_df <- tidy(lm_ols) -lm_df +```{r} +mtcars$gear_f <- factor(mtcars$gear, levels = 3:5, labels = c("3-gear", "4-gear", "5-gear")) +table(mtcars$gear) +table(mtcars$gear_f); class(mtcars$gear_f) ``` ----- +## The result {.smaller} -```{r message=F, fig.align="center", fig.height=4} -dwplot(lm_df) +```{r} +lm_f <- lm(mpg ~ cyl + hp + wt + gear_f, data = mtcars) +summary(lm_f) ``` -## Plotting with packages: `interplot`{.smaller} +## Interaction +Two-way interaction: horsepower * Weight - -```{r message=FALSE} -library(interplot) +```{r} lm_in <- lm(mpg ~ cyl + hp * wt, data = mtcars) + +``` + +Equivalent to `lm_in2 <- lm(mpg ~ cyl + hp + wt + hp:wt, data = mtcars)` + + +## The result {.smaller} +```{r} summary(lm_in) +``` + + +## Post-estimate diagnoses: Residural + +```{r fig.height=3.5, fig.align="center"} +res <- resid(lm_ols); res[1:4] +plot(lm_ols, which = 1) # residural vs. fitted plot +``` + +## Post-estimate diagnoses: Outliers +```{r} +car::outlierTest(lm_ols) # Bonferonni p-value for most extreme obs ``` ---- -```{r fig.align="center"} -interplot(m = lm_in, var1 = "hp", var2 = "wt") + - xlab("Automobile Weight (thousands lbs)") + - ylab("Estimated Coefficient for \nGross horsepower") +```{r} +car::qqPlot(lm_ols) #qq plot for studentized resid ``` -## Wrap Up -* R has a bunch of packages for creating publishing-like tables, e.g., `stargazer`, `xtable`, and `texreg` -* There are three ways to visualize statistics in R: basic, lattice (`ggplot`), and interactive. - + basic: e.g., `hist()` - + `ggplot`: /n e.g., `ggplot(, aes(x=)) + geom_histogram()`. +## Post-estimate diagnoses: CLRM Properties{.build} +* Heteroscedasticity + +```{r} +car::ncvTest(lm_ols) +``` -* Two special types of plot: - + Estimate plot with [`dotwhisker`](https://cran.r-project.org/web/packages/interplot/vignettes/interplot-vignette.html). - + Interaction plot with [`interplot`](https://cran.r-project.org/web/packages/dotwhisker/vignettes/dwplot-vignette.html). +* Multicollinearity +```{r} +car::vif(lm_ols) +``` -## Almost the end: one topic left +---- -
-[![present](http://conservatives4palin.com/wp-content/uploads/2013/06/snob.gif)] -
+Autocorrelation +```{r} +car::durbinWatsonTest(lm_ols) +``` -# Version Control -## Just a brief introduction{.columns-2 .build} -
- -
+## Logit +$vs = \frac{1}{1 + e^{-(\beta_0 + \beta_1cylinder + \beta_2horsepower + \beta_3weight + \varepsilon)}}$ + +```{r} +logit <- glm(vs ~ cyl + hp + wt, data = mtcars, family = "binomial") +``` + +MLE on other distributions: change the value of the argument `family` to `Gamma`, `poisson`, `gaussian`, etc. +## The result{.smaller} +```{r} +summary(logit) +``` + + +## Interpretation: Margin + +```{r message=FALSE} +library(mfx) +logit_m <- logitmfx(vs ~ cyl + hp + wt, data = mtcars) +logit_m +``` +## Interpretation: Predicted probability +Predicted Probability when `cyl` changes from 4 to 6. + +```{r} +# Step 1: creat an aggregate data +mtcars_fake <- with(mtcars, data.frame(cyl = 4:6, hp = mean(hp), wt = mean(wt))) +# Step 2: predict based on the new data +logit_pp4 <- cbind(mtcars_fake,predict(logit, newdata = mtcars_fake, type = "link", se = TRUE)) +# Step 3: convert to probability +logit_pp4 <- within(logit_pp4, {pp <- plogis(fit) + lb <- plogis(fit - 1.96 * se.fit) + ub <- plogis(fit + 1.96 * se.fit)}) +logit_pp4[,7:9] +``` +## Wrap Up +* OLS: `lm(Y ~ X, data = data)` + + Non-linear transformations: `I(X^2)`, `exp(X)`, `log(X)`. + + Using factor variable: R will handle that for you. + + Interaction: `lm(Y ~ X * Z, data = data)`. + + Post-estimate diagnoses: `resid()`, `outlierTest()`, `qqPlot()`, `ncvTest()`, `vif()`, `durbinWatsonTest()` +* Logit: `glm(Y ~ X, data = data, family = "binomial")` + + Margins: using `mfx::logitmfx` + + Predict probabilty: + + Step 1: create an aggregate data + + Step 2: predict the log odds + + Step 3: transfer to probability + +---- +Next: Presenting with R -* Tried to recall the deleted codes? -* Tried to figure out what changes? -* Saved a lot of replication files? -* Version control can help you. +
----- + -
-
+ +## See you then ~ -## Using Git with RStudio +
-* RStudio has associate with the Git and SVN very well. -* Process to use git: - + Register a user account in https://github.com. - + Connect your account with RStudio following [this instruction](http://www.molecularecologist.com/2013/11/using-github-with-r-and-rstudio/). - + Create a version-control project in RStudio - + - + Commit, Pull and Push + +
## External Sources +* My email: [yue-hu-1@uiowa.edu](mailto: yue-hu-1@uiowa.edu) + +* Workshops: http://ppc.uiowa.edu/node/3608 +* Consulting service: http://ppc.uiowa.edu/node/3385/ * Q&A Blogs: + http://stackoverflow.com/questions/tagged/r + https://stat.ethz.ch/mailman/listinfo/r-help @@ -294,12 +336,4 @@ interplot(m = lm_in, var1 = "hp", var2 = "wt") + + http://www.cookbook-r.com/Graphs/ + http://shiny.stat.ubc.ca/r-graph-catalog/ -* Workshops: http://ppc.uiowa.edu/node/3608 -* Consulting service: http://ppc.uiowa.edu/node/3385/ - ----- - -
-[![end](http://rescuethepresent.net/tomandjerry/files/2016/05/16-thanks.gif)] -
diff --git a/RworkshopIV.Rmd b/RworkshopIV.Rmd index 6ebfcf5..0e8193c 100644 --- a/RworkshopIV.Rmd +++ b/RworkshopIV.Rmd @@ -1,166 +1,351 @@ --- title: "Hello, R!" -author: "Yue Hu's R Workshop Series IV" +author: "Yue Hu's R Workshop Series III" output: ioslides_presentation: + self_contained: yes incremental: yes logo: image/logo.gif slidy_presentation: null transition: faster widescreen: yes --- +```{r setup, include=FALSE} +knitr::opts_chunk$set(message = FALSE, warning = FALSE) +``` -# Preface -## What Are Covered in This Workshop Series -* [A overview of R](https://rpubs.com/sammo3182/Rintro) -* [Data manipulation (input/output, row/column selections, etc.)](https://rpubs.com/sammo3182/Rintro) -* [Descriptive and binary hypotheses (summary, correlation, t-test, etc.)](http://rpubs.com/sammo3182/Rstat) -* [Multiple regression (OLS, GLS, MLM, etc.)](http://rpubs.com/sammo3182/Rstat) -* **Multilevel Regression** -* [Presentation (table, graph)](https://rpubs.com/sammo3182/Rpresent) +## Tabling +There are over twenty packages for [table presentation](http://conjugateprior.org/2013/03/r-to-latex-packages-coverage/) in R. My favoriate three are `stargazer`, `xtable`, and `texreg`. -# What's Multilevel Effects?{.hcenter} -## An Example about Pizza -How do cost/fuel affect pizza quality? +(Sorry, but all of them are for **Latex** output) - +* `stargazer`: good for summary table and regular regression results +* `texreg`: when some results can't be presented by `stargazer`, try `texreg` (e.g., MLM results.) +* `xtable`: the most extensively compatible package, but need more settings to get a pretty output, most of which `stargazer` and `texreg` can automatically do for you. +## An example {.smaller .columns-2} -## An Example about Pizza{.hcenter} -How do those factors vary by neighborhood? +```{r message = F} +lm_ols <- lm(mpg ~ cyl + hp + wt, data = mtcars) +stargazer::stargazer(lm_ols, type = "text", align = T) +``` - +* For Word users, click [here](http://www.r-statistics.com/2010/05/exporting-r-output-to-ms-word-with-r2wd-an-example-session/). +## Print out directly in the website or the manuscript{.smaler} +```{r results='asis'} +stargazer::stargazer(lm_ols, type = "html", align = T) +``` -## Data -(Based on Harris \& Lander's ["Predicting Pizza in Chinatown: An Intro to Multilevel Regression"(2010)](http://www.jaredlander.com/wordpress/wordpress-2.9.2/wordpress/wp-content/uploads/2010/10/NYC-PA-Meetup-Multilevel-Models.ppt)) +# But...why tabulating the results if you can plot it? +## How do R plots look like +
+ +
-```{r message=FALSE} -library(RCurl);library(dplyr) # load package for reading url and manipulate data -path <- getURL("https://raw.githubusercontent.com/HarlanH/nyc-pa-meetup-multilevel-pizza/master/Fake%20Pizza%20Data.csv") -pizza <- read.csv(text = path) # read the csv data -glimpse(pizza) +---- +
+ +
+ +---- +
+ +
+ +## Too "fancy" for your research? Then... +*
+ +
+ +---- +
+ +
+ +---- +
+ +
+ +## Let's Start! + +* Basic plots: `plot()`. +* Lattice plots: e.g., `ggplot()`. +* Interactive plots: `shiny()`. (save for later) + +
+ +
+ +## Basic plot +Pro: + +* Embedded in R +* Good tool for data exploration. +* Spatial analysis and 3-D plots. + +Con: + +* Not very pretty +* Not very flexible + +## An example: create a histogram + +```{r fig.align="center"} +hist(mtcars$mpg) +``` + +## Saving the plot{.build} +* Compatible format:`.jpg`, `.png`, `.wmf`, `.pdf`, `.bmp`, and `postscript`. +* Process: + 1. call the graphic device + 2. plot + 3. close the device + +```{r eval = F} +jpeg("histgraph.jpg") +hist +dev.off() ``` +Tip +
+Sometimes, RStudio may distort the graphic output. In this situation, try to zoom or use `windows()` function. +
+ +---- + +The device list: + +| Function | Output to | +|----------------------------- |------------------ | +| pdf("mygraph.pdf") | pdf file | +| win.metafile("mygraph.wmf") | windows metafile | +| png("mygraph.png") | png file | +| jpeg("mygraph.jpg") | jpeg file | +| bmp("mygraph.bmp") | bmp file | +| postscript("mygraph.ps") | postscript file | + -## Neighborhood Variance{.smaller} -(Multilevel Effects!!) +## `ggplot`: the most popular graphic engine in R {.build} -```{r message=FALSE, fig.align="center", fig.height = 3} ++ Built by Hadley Wickham based on Leland Wilkinson's *Grammar of Graphics*. ++ It breaks the plot into components as scales and layers---increase the flexibility. ++ To use `ggplot`, one needs to install the package `ggplot2` first. + +```{r message=FALSE} library(ggplot2) -lm_nei <- lm(Rating ~ CostPerSlice * Neighborhood, data=pizza) -pizza$pre_nei <- predict(lm_nei) -ggplot(pizza, aes(CostPerSlice, Rating, color=Neighborhood)) + - geom_point() + theme_bw() + - geom_smooth(aes(y = pre_nei), method='lm',se=FALSE) + - xlab("Cost per Slice") + ylab("Quality") ``` -## Dig in -```{r echo=FALSE, fig.align="center"} -lm_sour <- lm(Rating ~ CostPerSlice * HeatSource, data=pizza) -pizza$pre_sour <- predict(lm_sour) -ggplot(pizza, aes(CostPerSlice, Rating, color=HeatSource)) + - geom_point() + facet_wrap(~ Neighborhood) + theme_bw() + - xlab("Cost per Slice") + ylab("Quality") + - geom_smooth(aes(y=pre_sour), method='lm', se=FALSE) +## Histogram in `ggplot` +```{r fig.align="center", fig.height=2.7} +ggplot(mtcars, aes(x=mpg)) + + geom_histogram(aes(y=..density..), binwidth=2, colour="black") ``` +## Decoration + +```{r fig.align="center", fig.height=2.7} +ggplot(mtcars, aes(x=mpg)) + + geom_histogram(aes(y=..density..), binwidth=2, colour="black", fill="purple") + + geom_density(alpha=.2, fill="blue") + # Overlay with transparent density plot + theme_bw() + ggtitle("histogram with a Normal Curve") + + xlab("Miles Per Gallon") + ylab("Density") +``` -## Multilevel Model (MLM) -```{r message = F} -library(lme4) # package for multilevel model -# Allow intercept varying -mlm_fix <- lmer(Rating ~ HeatSource + (1 | Neighborhood),data=pizza) -# Allow slope varing -mlm_ran <- lmer(Rating ~ HeatSource + CostPerSlice + - (CostPerSlice | Neighborhood),data=pizza) +## Break in Parts:{.smaller} -# Slop varying but not correlate to intercept -mlm_ur <- lmer(Rating ~ HeatSource + CostPerSlice + - (CostPerSlice || Neighborhood),data=pizza) -# Just for the purpose of instruction +```{r eval=FALSE} +ggplot(data = mtcars, aes(x=mpg)) + + geom_histogram(aes(y=..density..), binwidth=2, colour="black", fill="purple") + + geom_density(alpha=.2, fill="blue") + # Overlay with transparent density plot + theme_bw() + ggtitle("histogram with a Normal Curve") + + xlab("Miles Per Gallon") + ylab("Density") ``` +* `data`: The data that you want to visualise + +* `aes`: Aesthetic mappings +describing how variables in the data are mapped to aesthetic attributes + + horizontal position (`x`) + + vertical position (`y`) + + colour + + size +* `geoms`: Geometric objects that represent what you actually see on +the plot + + points + + lines + + polygons + + bars -## Result: Fixed Effect {.smaller .columns-2} -```{r size = "tiny"} -summary(mlm_fix) +---- + +* `theme`, `ggtitle`, `xlab`, `ylab`: decorations. +* Other parts you may see in some developed template + + `stats`: Statistics transformations + + `scales`: relate the data to the aesthetic + + `coord`: a coordinate system that describes how data coordinates are +mapped to the plane of the graphic. + + `facet`: a faceting specification describes how to break up the data into sets. + +## An advanced version: +```{r fig.height=3} +library(dplyr) +df_desc <- select(mtcars, am, carb, cyl, gear,vs) %>% # select the variables + tidyr::gather(var, value) # reshape the wide data to long data + +ggplot(data = df_desc, aes(x = as.factor(value))) + geom_bar() + + facet_wrap(~ var, scales = "free", ncol = 5) + xlab("") +``` + +## Save `ggplot` +* `ggsave(, "")`: + + When the `` is omitted, R will save the last presented plot. + + There are additional arguments which users can use to adjust the size, path, scale, etc. + + + +## Plotting with packages: `dotwhisker`{.smaller} +Plot the comparable coefficients or other estimates (margins, predicted probabilities, etc.). + +```{r message=FALSE} +library(dotwhisker) +library(broom) +m1 <- lm(mpg ~ wt + cyl + disp + gear, data = mtcars) ``` -## Result: Fixed + Random Effect {.smaller .columns-2} +---- ```{r} -summary(mlm_ran) +summary(m1) ``` -## Result: Uncorrelated Random Effect {.smaller .columns-2} +---- + ```{r} -summary(mlm_ur) # Shouldn't do since cor between CostPerSlice and Interaction was - .3 +dwplot(m1) ``` -## About Covariance Matrix -* Using Cholesky parameterization (which requires exchange matrix.): - + Avoid uneccessary rising of asymptotically flat surface warning---**Easier to converge**. - + Benefit **small- to medium-sized** data sets and complex variance-covariance models. -* If you want to use log-Cholesky (unconstrained) parameterization, you want to use `nlme` package +---- + +```{r message=F, fig.align="center", fig.height=4} +m2 <- update(m1, . ~ . + hp) # add another predictor +m3 <- update(m2, . ~ . + am) # and another -## Diagnosis -Fitted vs. residual plot -```{r, fig.align="center", fig.height=4} -plot(mlm_ran, type = c("p", "smooth")) +dwplot(list(m1, m2, m3)) ``` ---- -Quantile-Quantile plots -```{r fig.align="center"} -lattice::qqmath(mlm_ran) +```{r eval = F} +dwplot(list(m1, m2, m3)) + + relabel_y_axis(c("Weight", "Cylinders", "Displacement", + "Gears", "Horsepower", "Manual")) + + theme_bw() + xlab("Coefficient Estimate") + ylab("") + + geom_vline(xintercept = 0, colour = "grey60", linetype = 2) + + ggtitle("Predicting Gas Mileage") + + theme(plot.title = element_text(face="bold"), + legend.justification=c(0, 0), legend.position=c(0, 0), + legend.background = element_rect(colour="grey80"), + legend.title = element_blank()) ``` -## Diagnosis: Posterior Predictive Simulation +---- -```{r} -iqrvec <- sapply(simulate(mlm_ran, 1000), IQR) -obsval <- IQR(pizza$Rating) -post_pred_p <- mean(obsval >= c(obsval, iqrvec)) -post_pred_p +```{r echo = F} +dwplot(list(m1, m2, m3)) + + relabel_y_axis(c("Weight", "Cylinders", "Displacement", + "Gears", "Horsepower", "Manual")) + + theme_bw() + xlab("Coefficient Estimate") + ylab("") + + geom_vline(xintercept = 0, colour = "grey60", linetype = 2) + + ggtitle("Predicting Gas Mileage") + + theme(plot.title = element_text(face="bold"), + legend.justification=c(0, 0), legend.position=c(0, 0), + legend.background = element_rect(colour="grey80"), + legend.title = element_blank()) ``` -Warning: the above method does not allow for the uncertainty in the estimated parameters. -## Present -Fixed effect coefficients: `dotwhisker` +## Plotting with packages: `interplot`{.smaller} + + ```{r message=FALSE} -library(broom);library(dotwhisker) -mlm_coef <- tidy(mlm_ran) -delete <- grep("\\bsd_.*|cor_.*\\b", mlm_coef$term, value = T) -mlm_sub <- filter(mlm_coef, term != delete) %>% filter(term != "(Intercept)") +library(interplot) +lm_in <- lm(mpg ~ cyl + hp * wt, data = mtcars) ``` -Only keep the substantive variables. ---- +```{r} +summary(lm_in) +``` + +---- ```{r fig.align="center"} -dwplot(mlm_sub) + ylab("Fixed Effect") + xlab("Coefficient") + - geom_vline(xintercept = 0, colour = "red", linetype = 2) +interplot(m = lm_in, var1 = "hp", var2 = "wt", hist = TRUE) + + xlab("Automobile Weight (thousands lbs)") + + ylab("Estimated Coefficient for \nGross horsepower") ``` +## Wrap Up +* R has a bunch of packages for creating publishing-like tables, e.g., `stargazer`, `xtable`, and `texreg` -## Interaction{.smaller} -Use `interplot` package: -```{r message=FALSE, fig.align="center", fig.height=3.5, warning=FALSE} -mlm_int <- lmer(Rating ~ HeatSource * CostPerSlice + (CostPerSlice | Neighborhood),data=pizza) -library(interplot) -interplot(mlm_int, var1 = "HeatSource", var2 = "CostPerSlice", hist = T) + - xlab("Cost Per Slice") + ylab("Estimated Coefficient for Heat Source") -``` +* There are three ways to visualize statistics in R: basic, lattice (`ggplot`), and interactive. + + basic: e.g., `hist()` + + `ggplot`: /n e.g., `ggplot(, aes(x=)) + geom_histogram()`. + +* Two special types of plot: + + Estimate plot with [`dotwhisker`](https://cran.r-project.org/web/packages/interplot/vignettes/interplot-vignette.html). + + Interaction plot with [`interplot`](https://cran.r-project.org/web/packages/dotwhisker/vignettes/dwplot-vignette.html). + + +## Almost the end: one topic left + +
+[![present](http://conservatives4palin.com/wp-content/uploads/2013/06/snob.gif)] +
+ + +# Version Control +## Just a brief introduction{.columns-2 .build} +
+ +
+ + + + + + + + +* Tried to recall the deleted codes? +* Tried to figure out what changes? +* Saved a lot of replication files? +* Version control can help you. + +---- + +
+ +
+ + +## Using Git with RStudio + +* RStudio has associate with the Git and SVN very well. +* Process to use git: + + Get a user account in https://github.com. + + Connect your account with RStudio following [this instruction](http://www.molecularecologist.com/2013/11/using-github-with-r-and-rstudio/). + + Create a version-control project in RStudio + + + + Commit, Pull and Push ## External Sources @@ -175,11 +360,11 @@ interplot(mlm_int, var1 = "HeatSource", var2 = "CostPerSlice", hist = T) + + http://shiny.stat.ubc.ca/r-graph-catalog/ * Workshops: http://ppc.uiowa.edu/node/3608 -* Consulting service: http://ppc.uiowa.edu/node/3385/ +* Consulting service: http://ppc.uiowa.edu/isrc/methods-consulting + ----
- +[![end](http://rescuethepresent.net/tomandjerry/files/2016/05/16-thanks.gif)]
- diff --git a/RworkshopV.Rmd b/RworkshopV.Rmd new file mode 100644 index 0000000..d71a39a --- /dev/null +++ b/RworkshopV.Rmd @@ -0,0 +1,201 @@ +--- +title: "Hello, R!" +author: "Yue Hu's R Workshop Series IV" +output: + ioslides_presentation: + incremental: yes + logo: image/logo.gif + slidy_presentation: null + transition: faster + widescreen: yes +--- + +# Preface +## What Are Covered in This Workshop Series +* [A overview of R](https://rpubs.com/sammo3182/Rintro) +* [Data manipulation (input/output, row/column selections, etc.)](https://rpubs.com/sammo3182/Rintro) +* [Descriptive and binary hypotheses (summary, correlation, t-test, etc.)](http://rpubs.com/sammo3182/Rstat) +* [Multiple regression (OLS, GLS, MLM, etc.)](http://rpubs.com/sammo3182/Rstat) +* **Multilevel Regression** +* [Presentation (table, graph)](https://rpubs.com/sammo3182/Rpresent) + + +# What's Multilevel Effects? +## An Example about Pizza{.columns-2} +How do cost/fuel affect pizza quality? + + + + + +How do the impact of these factors vary by neighborhood? + + + + + + +## Data +Based on Harris \& Lander's ["Predicting Pizza in Chinatown: An Intro to Multilevel Regression"(2010)](http://www.jaredlander.com/wordpress/wordpress-2.9.2/wordpress/wp-content/uploads/2010/10/NYC-PA-Meetup-Multilevel-Models.ppt) + +```{r message=FALSE} +library(RCurl);library(dplyr) # load package for reading url and manipulate data +path <- getURL("https://raw.githubusercontent.com/HarlanH/nyc-pa-meetup-multilevel-pizza/master/Fake%20Pizza%20Data.csv") +pizza <- read.csv(text = path) # read the csv data +glimpse(pizza) +``` + + +## Neighborhood Variance{.smaller} +(Multilevel Effects!!) + +```{r message=FALSE, fig.align="center", fig.height = 3} +library(ggplot2) +lm_nei <- lm(Rating ~ CostPerSlice * Neighborhood, data=pizza) +pizza$pre_nei <- predict(lm_nei) +ggplot(pizza, aes(CostPerSlice, Rating, color=Neighborhood)) + + geom_point() + theme_bw() + + geom_smooth(aes(y = pre_nei), method='lm',se=FALSE) + + xlab("Cost per Slice") + ylab("Quality") +``` + +## Dig in + +```{r echo=FALSE, fig.align="center"} +lm_sour <- lm(Rating ~ CostPerSlice * HeatSource, data=pizza) +pizza$pre_sour <- predict(lm_sour) +ggplot(pizza, aes(CostPerSlice, Rating, color=HeatSource)) + + geom_point() + facet_wrap(~ Neighborhood) + theme_bw() + + xlab("Cost per Slice") + ylab("Quality") + + geom_smooth(aes(y=pre_sour), method='lm', se=FALSE) +``` + + +## Multilevel Model: Fixed Effect {.smaller .columns-2} +```{r message = F} +library(lme4) # package for multilevel model +# Allow intercept varying +mlm_fix <- lmer(Rating ~ HeatSource + + (1 | Neighborhood), data = pizza) +summary(mlm_fix) +``` + +## Result: Fixed + Random Effect {.smaller .columns-2} +```{r} +# Allow slope varing +mlm_ran <- lmer(Rating ~ HeatSource + CostPerSlice + + (CostPerSlice | Neighborhood), + data = pizza) +summary(mlm_ran) +``` + +## Result: Uncorrelated Random Effect {.smaller .columns-2} +```{r} +# Slop varying but not correlate to intercept +mlm_ur <- lmer(Rating ~ HeatSource + CostPerSlice + + (CostPerSlice || Neighborhood), + data=pizza) +# Just for the purpose of instruction +summary(mlm_ur) # Shouldn't do since cor between CostPerSlice and Interaction was - .3 +``` + +## About Covariance Matrix +* Using Cholesky parameterization (which requires exchange matrix.): + + Avoid uneccessary rising of asymptotically flat surface warning---**Easier to converge**. + + Benefit **small- to medium-sized** data sets and complex variance-covariance models. +* If you want to use log-Cholesky (unconstrained) parameterization, you want to use `nlme` package + + +## Presentation +Fixed effect coefficients: `dotwhisker` +```{r message=FALSE} +library(broom);library(dotwhisker) +mlm_coef <- tidy(mlm_ran) +mlm_coef +``` + +---- + +```{r} +delete <- grep("\\bsd_.*|cor_.*\\b", mlm_coef$term, value = T) +mlm_sub <- filter(mlm_coef, term != delete) %>% filter(term != "(Intercept)") +mlm_sub +``` + +Only keep the substantive variables. + +---- + + +```{r fig.align="center"} +dwplot(mlm_sub) + ylab("Fixed Effect") + xlab("Coefficient") + + geom_vline(xintercept = 0, colour = "red", linetype = 2) +``` + + +## Interaction{.smaller} +Use `interplot` package: +```{r message=FALSE, fig.align="center", fig.height=3.5, warning=FALSE} +mlm_int <- lmer(Rating ~ HeatSource * CostPerSlice + (CostPerSlice | Neighborhood),data=pizza) +library(interplot) +interplot(mlm_int, var1 = "HeatSource", var2 = "CostPerSlice", hist = T) + + xlab("Cost Per Slice") + ylab("Estimated Coefficient for Heat Source") +``` + +## Bonus +### Categorical DV: Ordinal +```{r} +pizza$Rate_o <- cut(pizza$Rating, quantile(pizza$Rating), include.lowest = T, + labels = c(1:4)) %>% + as.ordered() +table(pizza$Rate_o) +``` + +```{r message = FALSE} +library(ordinal) +pizza$Neighborhood_fa <- as.factor(pizza$Neighborhood) +mlm_ord <- clmm(Rate_o ~ HeatSource + (1|Neighborhood_fa), data=pizza) +``` + +## Output + +```{r} +summary(mlm_ord) +``` + + +## Categorical DV: Nominal +```{r warning=FALSE} +pizza$Rate_f <- cut(pizza$Rating, quantile(pizza$Rating), + include.lowest = T, labels = c(1:4)) +mlm_nom <- clmm2(Rate_f ~ 1, nominal = ~ HeatSource, + random = Neighborhood_fa, data=pizza, + nAGQ = 15, Hess = TRUE) #nAGQ set the optimizer, not necessary. +``` + +## Output{.columns-2} +```{r} +summary(mlm_nom) +``` + + +## External Sources +* Q&A Blogs: + + http://stackoverflow.com/questions/tagged/r + + https://stat.ethz.ch/mailman/listinfo/r-help + +* Blog for new stuffs: http://www.r-bloggers.com/ + +* Graph Blogs: + + http://www.cookbook-r.com/Graphs/ + + http://shiny.stat.ubc.ca/r-graph-catalog/ + +* Workshops: http://ppc.uiowa.edu/node/3608 +* Consulting service: http://ppc.uiowa.edu/node/3385/ + +---- + +
+ +
+