tutorial.Rmd


---
title: R meetup - Oslo
output:
  html_document:
    toc: true
    theme: united
  pdf_document:
    toc: true
    highlight: zenburn
---


---------

About me
--------

My name is Leon du Toit and I'm from South Africa.
I tweet under [lcdutoit](https://twitter.com/lcdutoit), [blog](http://www.leonomics.com/) about technology, economics and the world.
Work-wise I really enjoy making sense of data with open source tools.


--------------


Setup
-----

* you need [git](http://git-scm.com/downloads)
* you also need a C++ compiler (on Mac this means XCode command line tools, on windows and Linux probably Clang or a recent version of gcc)
* [download](http://cran.uib.no/) and install R (make sure you get version >= 3.1.0)
* make a workshop directory `$ mkdir rworkshop && cd rworkshop`
* clone workshop package `$ git clone git@github.com:leondutoit/rmeetupdemo.git` into this directory
* you should have a directory structure like this `rworkshop/rmeetupdemo`
* install dependencies `$ cd rmeetupdemo && chmod 755 ./dependencies.R && sudo ./dependencies.R && cd ..`
* install rmeetupdemo: `$ R -e "devtools::install('rmeetupdemo')"`
* lastly, for serving the dashboard, [download](http://www.rstudio.com/products/RStudio/) and install Rstudio IDE

Let me know if you have any issues :)

---------

A bit of R history and context
------------------------------

R is a GNU open source continuation of the S language and environment developed at Bell Laboratories (formerly AT&T) by Rick Becker, John
Chambers and Allan Wilks. From the very beginning it was designed with interactive data oriented work in mind.

It was then reimplemented as "R" by Ross Ihaka and Robert Gentleman at the Department of Statistics, University of Aukland, New Zealand. R 1.0.0 was released on 29 February 2000 and 2.0.0 on 4 October 2004.
R 3.0.0 came out on 3 April 2013.

GNU R's implementation consists of 40% C, 36% R, 23 % FORTRAN another 1%. You can browse the code on [this](https://github.com/wch/r-source) github read-only mirror or check it out from [svn](https://svn.r-project.org/R/).

To describe R as a programming language I am going to begin with this quote from [the README](https://github.com/wch/r-source/blob/trunk/README): "The core of R is an interpreted computer language with a syntax superficially similar to C, but which is actually a "functional programming language" with capabilities similar to Scheme." My guess, and this is only my guess, is that the functional programming language is in inverted commas because of mutable state.

R is an excellent language for practical data science - data manipulation, modelling and visualization. It allows you to address a wide variety of common and important use cases with little programming effort. Other reasons are: It is not difficult to run code in production or produce production quality code; the community is vibrant; there is plenty of innovation in package development; there is a low barrier to interoperability with other languages (e.g. C, C++, Java); documentation, examples and help is easy to find and getting started takes little effort. 


---------


Data structures
---------------

Before we begin note two important things:
* R indexes start from 1 not 0.
* idiomatic assignment is done with the `<-` operator.

Core data structures: vector, matrix, list, dataframe.

```{r, message = FALSE, warning = FALSE}

# vector (homogenous)
vec <- c(1, 2, 3)
vec

str(vec)

# matrix (homogenous, I very seldom use them explicitly)
m <- matrix(1:6, ncol = 3, nrow = 2)
m

# lists (heterogenuous, often used)
a_list <- list(elem1 = 1, elem2 = c(1, 2, 3))
a_list

str(a_list)

# data frame (the workhorse tabular data structure)
# think of it as a list of equal length named vectors
dat <- data.frame(x = 1:5, y = c('leon', 'charl', 'du', 'toit', 'ble'), stringsAsFactors = FALSE)
dat

str(dat)
```

Subsetting and indexing into data structures. There are two types of subsetting: 1) preserving which returns a subset of the data structure as the same class and 2) simplifying, which returns a subset as a class of what that element is. Let's make it concrete. Preserving looks like `[]` while simplifying looks like `[[]]` or `$`.

```{r, message = FALSE, warning = FALSE}
# we now have some objects in the environment
# let's have a look what they are
ls()

# a vector is atomic - no simplification possible
vec[1]

class(vec[1])

# matrices are vectors with a dimension attribute - attributes are metadata
attributes(m)

# or
dim(m)

# lists are not atomic and can be subsetted without simplification
a_list[1]

# or with
a_list[[1]]

class(a_list[1])

class(a_list[[1]])

# the same goes for dataframes
dat[1]

dat[[1]]

class(dat[1])

class(dat[[1]])

# other attributes of dataframes
names(dat)

dim(dat)

# the `$` operator is a shorthand for `[[]]` with fuzzy matching
dat$mynewcol <- rep(5, 5)
dat

dat$my

# One of the biggest gotchas...
# when passing a dataframe into a function
# and using column names for access I always use `[[]]`
# this is why

col <- "mynewcol"

dat[[col]]

#but
dat$col


```

-----

Functions
---------

```{r, message = FALSE, warning = FALSE}

# simple example
my_func <- function() {
  print("Hello there")
}
my_func()

# multiply numbers
mult <- function(a, b) {
  a * b
}
mult(3, 3)

# keyword args
mult2 <- function(a, b = 2) {
  a * b
}
mult2(3)

# variable args
mult_all <- function(...) {
  Reduce(mult, c(...))
}
mult_all(1, 2, 3, 4, 5)

# lambdas / anonymous functions
mult_all2 <- function(...) {
  Reduce(function(a, b) { a * b }, c(...))
}
mult_all2(1, 2, 3, 4, 5)

# Iffys - immediately invoked function expressions
(function() {
  d <- data.frame(x = 1:10)
  subset(d, d$x < 5)
})()

my_cleaner <- function(vector) {
  # vector cleaner
  vector[!is.na(vector)]
}

# first class functions
clean <- function(data, clean_func = my_cleaner) {
  clean_func(data)
}
clean(c(NA, 4, 5, 6))

```

These are the basics of functions in R. It is possible to do OO-style programming in R (there is more than one system to do so with) but I have not yet needed it.

-----


Exploring the data with dplyr
-----------------------------

[dplyr](https://github.com/hadley/dplyr) is a package specialised for data manipulation in data analysis. It has three main goals (in the words of the authors): 1) make the most important data manipulation verbs easily available in R; 2) performance for in-memory data; and 3) provide the same API for different in-memory and out-of-memomry data stores. Practially speaking, therefore, you could sample your data, figure it out in memory and execute the same code on your distributed cluster.


Let's manipulate immigration data from the Statistisk sentralbyrå and then talk about what the code is doing.

```{r, eval = FALSE, message = FALSE, warning = FALSE}

library(dplyr)
library(rmeetupdemo)

imm_data <- create_immigration_df()

# let's have a look at the data
basic_plot(imm_data)

# a more in-depth look (remove noise)
elaborate_plot(clean_data(imm_data))

# let's look at growth rates instead
# we'll calculate them using dplyr
# this will show some neat features
imm_growth <- imm_data %>%
  group_by(background, sex) %>%
  mutate(percentage_change = round((value - lag(value))/value*100, 2))

imm_growth %>%
  select(time, background, sex, value, percentage_change) %>%
  glimpse()
```

First we load `dplyr` and `rmeetupdemo` packages. Then we create the immigration data frame using a function from the `rmeetupdemo` package. We can use two plotting functions from `rmeetupdemo` to have a look at the data. In the second plot we use another function from the `rmeetupdemo` package to remove noise from the data. Then we use the data manipulation capabilities of `dplyr` to add another column to the data frame. We calculate the year-on-year percentage change of immigration grouped by background and sex. We assign this modified data.frame to another variable (R and dplyr are clever enough not to copy data in this case).

Let's step through the `dplyr` code. First though, a pause on the `%>%` operator is in order. In R one can define arbitrary operators by placing any character or set of characters between two `%` signs. We could, therefore, make our own multiply operator as such: ``` `%mult%` <- function(a, b) { a * b } ``` and use it in the familiar infix way as such: `4 %mult% 4`. 

The `%>%` operator was defined in this way by the author of the [magrittr](https://github.com/smbache/magrittr) package. It is similar to F#'s pipe-forward operator `|>`, or Clojure's threading macro `->>`. In general it allows you to write ` g(f(x, y), z)` as `x %>% f(y) %>% g(z)`. In our example above the following statements are therefore equivalent:

```{r, eval = FALSE, label = magrittr_example}
# without the pipe
mutate(group_by(df, background, sex), percentage_change = round((value - lag(value))/value*100, 2))

# with the pipe
df %>% group_by(background, sex) %>% mutate(percentage_change = round((value - lag(value))/value*100, 2))
```

The motivation is that it makes code more readable. By avoiding deep nesting it can help promote clear code especially in the context of data manipulation. Stepping through the `dplyr` code is easy now. First we group the data according to `background` and `sex`, and then we calculate the percentage change per group. The call to `mutate` means that we want to change the data frame by adding another column.


-----

Visualise the growth rate with ggvis
------------------------------------

[ggvis](https://github.com/rstudio/ggvis) is an evolution of [ggplot2](http://ggplot2.org/) designed for the web. It is an attempt to implement an interactive grammar of graphics. The [grammar of graphics]() is a concept developed by Leland Wilkinson in his book, [The Grammar of Graphics](http://www.springer.com/statistics/computational+statistics/book/978-0-387-24544-7). In the book he outlines this grammar as a way to describe what your statistical graphic should look like. In this sense it is supposed to be declarative: you say what you want to see and the system figures out the details for you. His implementation was for SPSS but `ggplot2` implemented this in R.

`ggvis` is still in its infancy and lacks many features but is already good enough for simple visualisation use cases.

```{r, eval = FALSE, message = FALSE, warning = FALSE}

library(ggvis)

imm_growth %>% 
  ggvis(~background, ~value) %>% 
  layer_boxplots()

imm_growth %>% 
  ggvis(~time, ~percentage_change, fill = ~factor(background)) %>% 
  layer_points()

imm_growth[complete.cases(imm_growth$percentage_change),] %>%
  group_by(time) %>%
  summarise(total_immigration = sum(value)) %>%
  ggvis(~total_immigration) %>%
  layer_densities(adjust = input_slider(.1, 2, value = 1, step = .1, label = "Bandwidth adjustment"))

```

-----


Make an interactive Rmd document
--------------------------------

```{r setup, echo = FALSE, message = FALSE}
library(knitr)
knit_hooks$set(wrapper = function(before, options, envir) {
  if (before) {
    sprintf('    ```{r %s}\n', options$params.src)
  } else '    ```\n'
})
```

We can now use what we have learnt in combination with RMarkdown to create a dashboard. A basic RMarkdown code block is constructed like this:

```{r, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE, label = mylabel, wrapper = TRUE}
# code goes here...
```

The `r wrapper = TRUE` option is set to display the code chunk in addition to evaluating it. To make our interactive dashboard we will combine what we have seen so far into one `Rmd` file. 

The code to produce the dashboard is contained in the file in `rmeetupdemo/answers/dashboard1.Rmd`. Rewrite that into your own file.

We will use the [Rstudio IDE](http://www.rstudio.com/products/RStudio/) to run this locally - [download](http://www.rstudio.com/products/rstudio/download/) it; open the IDE, browse to the `dashboard1.Rmd` file and click `run document`. This will serve the `Rmd` file with shiny-server from the IDE. It can also be viewed in the browser.

For production deployments one would use shiny server - an R websocket server. For more information you can have a look at the shiny-server [repo](https://github.com/rstudio/shiny-server) and its [documentation](http://www.rstudio.com/products/shiny/shiny-server/).


-----

Use devtools and roxygen2 to package our code (extra)
-----------------------------------------------------

[devtools](https://github.com/hadley/devtools) is a package for package development. I consider it best practice to use it for development, even though it is not stictly necessary. To see the intended outcome of this section you can look in `rmeetupdemo/answers/rworkshop_package`.

In the interactive R session, in the `rworkshop` directory do the following: 

```{r, eval = FALSE, label = create_new_package}
library(devtools)
create("rworkshop_package")
```

Now create a file in the `R` directory named `immigration_manip.R`. This will hold our package code. Put this into the file:

```{r, eval = FALSE, label = R_package_code}

#' @import dplyr

#' @export
add_growth_rates <- function(df) {
  df %>%
    group_by(background, sex) %>%
    mutate(percentage_change = round((value - lag(value))/value*100, 2))
}

```
Let's step through the few lines of code from top to bottom. In this file we use annotated comments to produce documentation and to generate the necessary code in the `NAMESPACE` file - this will handle the dependency on `dplyr`, by making sure it will be loaded when we load our own package. Using this style of commenting we say that we want to import `dplyr` when loading this code. We also annotate the function `add_growth_rates` to the data with `#' @export`. Here we are saying that we want this function to be available to package users. It is equivalent to declaring a class method public. We then define the function.

Now let's build documentation and install this package into R locally:

```{r, eval = FALSE, label = R_package_install}
library(roxygen2)
roxygenise("rworkshop_package") # use your package name
install("rworkshop_package")
```

Check the contents of the `NAMESPACE` file - you should see and instruction to export the function. Next we can use the package with the `rmeetupdemo` package.

```{r, eval = FALSE, label = R_package_use}
library(rmeetupdemo)
library(rworkshop_package)
df <- clean_data(create_immigration_df())
add_growth_rates(df)
```

We should of course also test our function. To do that we will use the [testthat](https://github.com/hadley/testthat) package. First create the test directory in your package: `$ mkdir tests`. Now create a file called `test_growth.R`. Put this into the file:

```{r, eval = FALSE, label = R_package_testing}
library(testthat)

context("Growth calculations")

test_that("add_growth_rates calculates yearly growth correctly", {
  testdf <- data.frame(
    background = c("a", "a", "b", "b", "b", "b"),
    sex = c("male", "male", "female", "female", "male", "male"),
    value = c(10, 12, 5, 6, 3, 4),
    stringsAsFactors = FALSE)
  outdf <- add_growth_rates(testdf)
  correctdf <- data.frame(
    background = c("a", "a", "b", "b", "b", "b"),
    sex = c("male", "male", "female", "female", "male", "male"),
    value = c(10, 12, 5, 6, 3, 4),
    percentage_change = c(NA, 16.67, NA, 16.67, NA, 25.00),
    stringsAsFactors = FALSE)
  expect_equivalent(outdf, correctdf)
  # no numeric tolerance with identical expectations
  expect_identical(outdf$background, correctdf$background) 
  expect_identical(outdf$sex, correctdf$sex)
  expect_identical(outdf$value, correctdf$value)
  expect_identical(outdf$percentage_change, correctdf$percentage_change)
})

```

Lastly, for the package to be valid, we need to replace the values in the `DESCRIPTION` file with the relavant information:

```{r, eval = FALSE, label = DESCRIPTION_file}
Package: rworkshop_package
Title: What the package does (short line)
Version: 0.1
Authors@R: "First Last <first.last@example.com> [aut, cre]"
Description: What the package does (paragraph)
Depends: R (>= 3.1.0)
License: What license is it under?
LazyData: true
```

We can now refactor the dashboard to use our package. This is, in my experience, a good design pattern: keep data manipulation code in one place, and the interactive visualization elsewhere. The refactored dashboard is in `rmeetupdemo/answers/dashboard2.Rmd`.


-----

Recommended further reading
---------------------------

The possible reading list is vast but I will try to give some pointers to things I consider both important and interesting. 

* I actively work on [a data science wiki](https://github.com/leondutoit/data-centric-programming/wiki) where I show how to do practical work with R and keep a log of interesting articles and books
* [Rstudio](http://www.rstudio.com/) - a leading company that produces a high quality R IDE and many excellent open source R packages

Performance

* [data.table](https://github.com/Rdatatable/data.table) - for highly performant in-memory tabular data manipulation and file reading; this package is very well tested, has excellent documentation and is under active development; it is a go-to if memory footprint and speed are real concerns
* [Rcpp](https://github.com/RcppCore/Rcpp/) - established package for R and C++ interop
* [Rcpp11](https://github.com/Rcpp11/Rcpp11) - R and C++ for the C++11 standard
* [RcppParallel](https://github.com/RcppCore/RcppParallel) - parallel programming with Rcpp
* [RcppEigen](https://github.com/RcppCore/RcppEigen) - Rcpp integration for the Eigen templated linear algebra library
* [RcppArmadillo](https://github.com/RcppCore/RcppArmadillo) - Rcpp integration for Armadillo templated linear algebra library
* [Rllvm](https://github.com/duncantl/Rllvm) - R interface to LLVM C++ API
* [pqr](https://github.com/radfordneal/pqR) - a fork of the GNU R interpreter with innovations for speed and memory efficiency

Databases

* [DBI](https://github.com/rstats-db/DBI) - A database interface (DBI) definition for communication between R and RDBMSs; DB specific packages build on this
* [RPostgreSQL](http://cran.r-project.org/web/packages/RPostgreSQL/index.html) - for PostgreSQL
* [RSQlite](https://github.com/rstats-db/RSQLite) - for sqlite
* [RMySQL](http://cran.r-project.org/web/packages/RMySQL/index.html) - for mysql

Other good packages 

* [ggplot2](https://github.com/hadley/ggplot2) - A grammar of graphics for R
* [shiny](https://github.com/rstudio/shiny) - Easy interactive web applications with R
* [rCharts](https://github.com/ramnathv/rCharts/) - Interactive JS Charts from R
* [pryr](https://github.com/hadley/pryr) - inspect R internals interactively
* [evaluate](https://github.com/hadley/evaluate) - A version of eval for R that returns more information about what happened
* [memoise](https://github.com/hadley/memoise) - memoisation
* [lazyeval](https://github.com/hadley/lazyeval) - lazy evaluation
* [caret](https://github.com/topepo/caret) - predictive modeling
* [knitr](https://github.com/yihui/knitr) - A general-purpose tool for dynamic report generation in R
* [tidyr](https://github.com/hadley/tidyr) - Easily tidy data
* [rlogging](https://github.com/mjkallen/rlogging) - simple logging
* [httr](https://github.com/hadley/httr) - easy HTTP
* [RCurl](http://cran.r-project.org/web/packages/RCurl/index.html) - curl from R (low-level)
* [jsonlite](https://github.com/jeroenooms/jsonlite) - A Robust, High Performance JSON Parser and Generator for R
* [opencpu](https://github.com/jeroenooms/opencpu) - OpenCPU system for embedded scientific computation and reproducible research (or a way to expose your R package through HTTP)

Books

* [Advanced R](http://adv-r.had.co.nz/) - _the_ book for learning R programming
* [R Packages](http://r-pkgs.had.co.nz/) - learn how to write R packages
* [ggplot2: Elegant Graphics for Data Analysis](http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403) - a good introduction to the ideas behind the grammar of graphics as implemented in R; a bit old now but still a very good read
* [Seamless R and C++ Integration](http://www.springer.com/statistics/computational+statistics/book/978-1-4614-6867-7) - RCpp

Interesting projects

* [ropensci](http://ropensci.org/) - access to open scientific data from R
* [revolutionR](http://www.revolutionanalytics.com/) - enterprise version of R 
* [renjin](http://www.renjin.org/) - R on the JVM