-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtutorial.Rmd
485 lines (334 loc) · 20 KB
/
tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
---
title: R meetup - Oslo
output:
html_document:
toc: true
theme: united
pdf_document:
toc: true
highlight: zenburn
---
---------
About me
--------
My name is Leon du Toit and I'm from South Africa.
I tweet under [lcdutoit](https://twitter.com/lcdutoit), [blog](http://www.leonomics.com/) about technology, economics and the world.
Work-wise I really enjoy making sense of data with open source tools.
--------------
Setup
-----
* you need [git](http://git-scm.com/downloads)
* you also need a C++ compiler (on Mac this means XCode command line tools, on windows and Linux probably Clang or a recent version of gcc)
* [download](http://cran.uib.no/) and install R (make sure you get version >= 3.1.0)
* make a workshop directory `$ mkdir rworkshop && cd rworkshop`
* clone workshop package `$ git clone git@github.com:leondutoit/rmeetupdemo.git` into this directory
* you should have a directory structure like this `rworkshop/rmeetupdemo`
* install dependencies `$ cd rmeetupdemo && chmod 755 ./dependencies.R && sudo ./dependencies.R && cd ..`
* install rmeetupdemo: `$ R -e "devtools::install('rmeetupdemo')"`
* lastly, for serving the dashboard, [download](http://www.rstudio.com/products/RStudio/) and install Rstudio IDE
Let me know if you have any issues :)
---------
A bit of R history and context
------------------------------
R is a GNU open source continuation of the S language and environment developed at Bell Laboratories (formerly AT&T) by Rick Becker, John
Chambers and Allan Wilks. From the very beginning it was designed with interactive data oriented work in mind.
It was then reimplemented as "R" by Ross Ihaka and Robert Gentleman at the Department of Statistics, University of Aukland, New Zealand. R 1.0.0 was released on 29 February 2000 and 2.0.0 on 4 October 2004.
R 3.0.0 came out on 3 April 2013.
GNU R's implementation consists of 40% C, 36% R, 23 % FORTRAN another 1%. You can browse the code on [this](https://github.com/wch/r-source) github read-only mirror or check it out from [svn](https://svn.r-project.org/R/).
To describe R as a programming language I am going to begin with this quote from [the README](https://github.com/wch/r-source/blob/trunk/README): "The core of R is an interpreted computer language with a syntax superficially similar to C, but which is actually a "functional programming language" with capabilities similar to Scheme." My guess, and this is only my guess, is that the functional programming language is in inverted commas because of mutable state.
R is an excellent language for practical data science - data manipulation, modelling and visualization. It allows you to address a wide variety of common and important use cases with little programming effort. Other reasons are: It is not difficult to run code in production or produce production quality code; the community is vibrant; there is plenty of innovation in package development; there is a low barrier to interoperability with other languages (e.g. C, C++, Java); documentation, examples and help is easy to find and getting started takes little effort.
---------
Data structures
---------------
Before we begin note two important things:
* R indexes start from 1 not 0.
* idiomatic assignment is done with the `<-` operator.
Core data structures: vector, matrix, list, dataframe.
```{r, message = FALSE, warning = FALSE}
# vector (homogenous)
vec <- c(1, 2, 3)
vec
str(vec)
# matrix (homogenous, I very seldom use them explicitly)
m <- matrix(1:6, ncol = 3, nrow = 2)
m
# lists (heterogenuous, often used)
a_list <- list(elem1 = 1, elem2 = c(1, 2, 3))
a_list
str(a_list)
# data frame (the workhorse tabular data structure)
# think of it as a list of equal length named vectors
dat <- data.frame(x = 1:5, y = c('leon', 'charl', 'du', 'toit', 'ble'), stringsAsFactors = FALSE)
dat
str(dat)
```
Subsetting and indexing into data structures. There are two types of subsetting: 1) preserving which returns a subset of the data structure as the same class and 2) simplifying, which returns a subset as a class of what that element is. Let's make it concrete. Preserving looks like `[]` while simplifying looks like `[[]]` or `$`.
```{r, message = FALSE, warning = FALSE}
# we now have some objects in the environment
# let's have a look what they are
ls()
# a vector is atomic - no simplification possible
vec[1]
class(vec[1])
# matrices are vectors with a dimension attribute - attributes are metadata
attributes(m)
# or
dim(m)
# lists are not atomic and can be subsetted without simplification
a_list[1]
# or with
a_list[[1]]
class(a_list[1])
class(a_list[[1]])
# the same goes for dataframes
dat[1]
dat[[1]]
class(dat[1])
class(dat[[1]])
# other attributes of dataframes
names(dat)
dim(dat)
# the `$` operator is a shorthand for `[[]]` with fuzzy matching
dat$mynewcol <- rep(5, 5)
dat
dat$my
# One of the biggest gotchas...
# when passing a dataframe into a function
# and using column names for access I always use `[[]]`
# this is why
col <- "mynewcol"
dat[[col]]
#but
dat$col
```
-----
Functions
---------
```{r, message = FALSE, warning = FALSE}
# simple example
my_func <- function() {
print("Hello there")
}
my_func()
# multiply numbers
mult <- function(a, b) {
a * b
}
mult(3, 3)
# keyword args
mult2 <- function(a, b = 2) {
a * b
}
mult2(3)
# variable args
mult_all <- function(...) {
Reduce(mult, c(...))
}
mult_all(1, 2, 3, 4, 5)
# lambdas / anonymous functions
mult_all2 <- function(...) {
Reduce(function(a, b) { a * b }, c(...))
}
mult_all2(1, 2, 3, 4, 5)
# Iffys - immediately invoked function expressions
(function() {
d <- data.frame(x = 1:10)
subset(d, d$x < 5)
})()
my_cleaner <- function(vector) {
# vector cleaner
vector[!is.na(vector)]
}
# first class functions
clean <- function(data, clean_func = my_cleaner) {
clean_func(data)
}
clean(c(NA, 4, 5, 6))
```
These are the basics of functions in R. It is possible to do OO-style programming in R (there is more than one system to do so with) but I have not yet needed it.
-----
Exploring the data with dplyr
-----------------------------
[dplyr](https://github.com/hadley/dplyr) is a package specialised for data manipulation in data analysis. It has three main goals (in the words of the authors): 1) make the most important data manipulation verbs easily available in R; 2) performance for in-memory data; and 3) provide the same API for different in-memory and out-of-memomry data stores. Practially speaking, therefore, you could sample your data, figure it out in memory and execute the same code on your distributed cluster.
Let's manipulate immigration data from the Statistisk sentralbyrå and then talk about what the code is doing.
```{r, eval = FALSE, message = FALSE, warning = FALSE}
library(dplyr)
library(rmeetupdemo)
imm_data <- create_immigration_df()
# let's have a look at the data
basic_plot(imm_data)
# a more in-depth look (remove noise)
elaborate_plot(clean_data(imm_data))
# let's look at growth rates instead
# we'll calculate them using dplyr
# this will show some neat features
imm_growth <- imm_data %>%
group_by(background, sex) %>%
mutate(percentage_change = round((value - lag(value))/value*100, 2))
imm_growth %>%
select(time, background, sex, value, percentage_change) %>%
glimpse()
```
First we load `dplyr` and `rmeetupdemo` packages. Then we create the immigration data frame using a function from the `rmeetupdemo` package. We can use two plotting functions from `rmeetupdemo` to have a look at the data. In the second plot we use another function from the `rmeetupdemo` package to remove noise from the data. Then we use the data manipulation capabilities of `dplyr` to add another column to the data frame. We calculate the year-on-year percentage change of immigration grouped by background and sex. We assign this modified data.frame to another variable (R and dplyr are clever enough not to copy data in this case).
Let's step through the `dplyr` code. First though, a pause on the `%>%` operator is in order. In R one can define arbitrary operators by placing any character or set of characters between two `%` signs. We could, therefore, make our own multiply operator as such: ``` `%mult%` <- function(a, b) { a * b } ``` and use it in the familiar infix way as such: `4 %mult% 4`.
The `%>%` operator was defined in this way by the author of the [magrittr](https://github.com/smbache/magrittr) package. It is similar to F#'s pipe-forward operator `|>`, or Clojure's threading macro `->>`. In general it allows you to write ` g(f(x, y), z)` as `x %>% f(y) %>% g(z)`. In our example above the following statements are therefore equivalent:
```{r, eval = FALSE, label = magrittr_example}
# without the pipe
mutate(group_by(df, background, sex), percentage_change = round((value - lag(value))/value*100, 2))
# with the pipe
df %>% group_by(background, sex) %>% mutate(percentage_change = round((value - lag(value))/value*100, 2))
```
The motivation is that it makes code more readable. By avoiding deep nesting it can help promote clear code especially in the context of data manipulation. Stepping through the `dplyr` code is easy now. First we group the data according to `background` and `sex`, and then we calculate the percentage change per group. The call to `mutate` means that we want to change the data frame by adding another column.
-----
Visualise the growth rate with ggvis
------------------------------------
[ggvis](https://github.com/rstudio/ggvis) is an evolution of [ggplot2](http://ggplot2.org/) designed for the web. It is an attempt to implement an interactive grammar of graphics. The [grammar of graphics]() is a concept developed by Leland Wilkinson in his book, [The Grammar of Graphics](http://www.springer.com/statistics/computational+statistics/book/978-0-387-24544-7). In the book he outlines this grammar as a way to describe what your statistical graphic should look like. In this sense it is supposed to be declarative: you say what you want to see and the system figures out the details for you. His implementation was for SPSS but `ggplot2` implemented this in R.
`ggvis` is still in its infancy and lacks many features but is already good enough for simple visualisation use cases.
```{r, eval = FALSE, message = FALSE, warning = FALSE}
library(ggvis)
imm_growth %>%
ggvis(~background, ~value) %>%
layer_boxplots()
imm_growth %>%
ggvis(~time, ~percentage_change, fill = ~factor(background)) %>%
layer_points()
imm_growth[complete.cases(imm_growth$percentage_change),] %>%
group_by(time) %>%
summarise(total_immigration = sum(value)) %>%
ggvis(~total_immigration) %>%
layer_densities(adjust = input_slider(.1, 2, value = 1, step = .1, label = "Bandwidth adjustment"))
```
-----
Make an interactive Rmd document
--------------------------------
```{r setup, echo = FALSE, message = FALSE}
library(knitr)
knit_hooks$set(wrapper = function(before, options, envir) {
if (before) {
sprintf(' ```{r %s}\n', options$params.src)
} else ' ```\n'
})
```
We can now use what we have learnt in combination with RMarkdown to create a dashboard. A basic RMarkdown code block is constructed like this:
```{r, eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE, label = mylabel, wrapper = TRUE}
# code goes here...
```
The `r wrapper = TRUE` option is set to display the code chunk in addition to evaluating it. To make our interactive dashboard we will combine what we have seen so far into one `Rmd` file.
The code to produce the dashboard is contained in the file in `rmeetupdemo/answers/dashboard1.Rmd`. Rewrite that into your own file.
We will use the [Rstudio IDE](http://www.rstudio.com/products/RStudio/) to run this locally - [download](http://www.rstudio.com/products/rstudio/download/) it; open the IDE, browse to the `dashboard1.Rmd` file and click `run document`. This will serve the `Rmd` file with shiny-server from the IDE. It can also be viewed in the browser.
For production deployments one would use shiny server - an R websocket server. For more information you can have a look at the shiny-server [repo](https://github.com/rstudio/shiny-server) and its [documentation](http://www.rstudio.com/products/shiny/shiny-server/).
-----
Use devtools and roxygen2 to package our code (extra)
-----------------------------------------------------
[devtools](https://github.com/hadley/devtools) is a package for package development. I consider it best practice to use it for development, even though it is not stictly necessary. To see the intended outcome of this section you can look in `rmeetupdemo/answers/rworkshop_package`.
In the interactive R session, in the `rworkshop` directory do the following:
```{r, eval = FALSE, label = create_new_package}
library(devtools)
create("rworkshop_package")
```
Now create a file in the `R` directory named `immigration_manip.R`. This will hold our package code. Put this into the file:
```{r, eval = FALSE, label = R_package_code}
#' @import dplyr
#' @export
add_growth_rates <- function(df) {
df %>%
group_by(background, sex) %>%
mutate(percentage_change = round((value - lag(value))/value*100, 2))
}
```
Let's step through the few lines of code from top to bottom. In this file we use annotated comments to produce documentation and to generate the necessary code in the `NAMESPACE` file - this will handle the dependency on `dplyr`, by making sure it will be loaded when we load our own package. Using this style of commenting we say that we want to import `dplyr` when loading this code. We also annotate the function `add_growth_rates` to the data with `#' @export`. Here we are saying that we want this function to be available to package users. It is equivalent to declaring a class method public. We then define the function.
Now let's build documentation and install this package into R locally:
```{r, eval = FALSE, label = R_package_install}
library(roxygen2)
roxygenise("rworkshop_package") # use your package name
install("rworkshop_package")
```
Check the contents of the `NAMESPACE` file - you should see and instruction to export the function. Next we can use the package with the `rmeetupdemo` package.
```{r, eval = FALSE, label = R_package_use}
library(rmeetupdemo)
library(rworkshop_package)
df <- clean_data(create_immigration_df())
add_growth_rates(df)
```
We should of course also test our function. To do that we will use the [testthat](https://github.com/hadley/testthat) package. First create the test directory in your package: `$ mkdir tests`. Now create a file called `test_growth.R`. Put this into the file:
```{r, eval = FALSE, label = R_package_testing}
library(testthat)
context("Growth calculations")
test_that("add_growth_rates calculates yearly growth correctly", {
testdf <- data.frame(
background = c("a", "a", "b", "b", "b", "b"),
sex = c("male", "male", "female", "female", "male", "male"),
value = c(10, 12, 5, 6, 3, 4),
stringsAsFactors = FALSE)
outdf <- add_growth_rates(testdf)
correctdf <- data.frame(
background = c("a", "a", "b", "b", "b", "b"),
sex = c("male", "male", "female", "female", "male", "male"),
value = c(10, 12, 5, 6, 3, 4),
percentage_change = c(NA, 16.67, NA, 16.67, NA, 25.00),
stringsAsFactors = FALSE)
expect_equivalent(outdf, correctdf)
# no numeric tolerance with identical expectations
expect_identical(outdf$background, correctdf$background)
expect_identical(outdf$sex, correctdf$sex)
expect_identical(outdf$value, correctdf$value)
expect_identical(outdf$percentage_change, correctdf$percentage_change)
})
```
Lastly, for the package to be valid, we need to replace the values in the `DESCRIPTION` file with the relavant information:
```{r, eval = FALSE, label = DESCRIPTION_file}
Package: rworkshop_package
Title: What the package does (short line)
Version: 0.1
Authors@R: "First Last <first.last@example.com> [aut, cre]"
Description: What the package does (paragraph)
Depends: R (>= 3.1.0)
License: What license is it under?
LazyData: true
```
We can now refactor the dashboard to use our package. This is, in my experience, a good design pattern: keep data manipulation code in one place, and the interactive visualization elsewhere. The refactored dashboard is in `rmeetupdemo/answers/dashboard2.Rmd`.
-----
Recommended further reading
---------------------------
The possible reading list is vast but I will try to give some pointers to things I consider both important and interesting.
* I actively work on [a data science wiki](https://github.com/leondutoit/data-centric-programming/wiki) where I show how to do practical work with R and keep a log of interesting articles and books
* [Rstudio](http://www.rstudio.com/) - a leading company that produces a high quality R IDE and many excellent open source R packages
Performance
* [data.table](https://github.com/Rdatatable/data.table) - for highly performant in-memory tabular data manipulation and file reading; this package is very well tested, has excellent documentation and is under active development; it is a go-to if memory footprint and speed are real concerns
* [Rcpp](https://github.com/RcppCore/Rcpp/) - established package for R and C++ interop
* [Rcpp11](https://github.com/Rcpp11/Rcpp11) - R and C++ for the C++11 standard
* [RcppParallel](https://github.com/RcppCore/RcppParallel) - parallel programming with Rcpp
* [RcppEigen](https://github.com/RcppCore/RcppEigen) - Rcpp integration for the Eigen templated linear algebra library
* [RcppArmadillo](https://github.com/RcppCore/RcppArmadillo) - Rcpp integration for Armadillo templated linear algebra library
* [Rllvm](https://github.com/duncantl/Rllvm) - R interface to LLVM C++ API
* [pqr](https://github.com/radfordneal/pqR) - a fork of the GNU R interpreter with innovations for speed and memory efficiency
Databases
* [DBI](https://github.com/rstats-db/DBI) - A database interface (DBI) definition for communication between R and RDBMSs; DB specific packages build on this
* [RPostgreSQL](http://cran.r-project.org/web/packages/RPostgreSQL/index.html) - for PostgreSQL
* [RSQlite](https://github.com/rstats-db/RSQLite) - for sqlite
* [RMySQL](http://cran.r-project.org/web/packages/RMySQL/index.html) - for mysql
Other good packages
* [ggplot2](https://github.com/hadley/ggplot2) - A grammar of graphics for R
* [shiny](https://github.com/rstudio/shiny) - Easy interactive web applications with R
* [rCharts](https://github.com/ramnathv/rCharts/) - Interactive JS Charts from R
* [pryr](https://github.com/hadley/pryr) - inspect R internals interactively
* [evaluate](https://github.com/hadley/evaluate) - A version of eval for R that returns more information about what happened
* [memoise](https://github.com/hadley/memoise) - memoisation
* [lazyeval](https://github.com/hadley/lazyeval) - lazy evaluation
* [caret](https://github.com/topepo/caret) - predictive modeling
* [knitr](https://github.com/yihui/knitr) - A general-purpose tool for dynamic report generation in R
* [tidyr](https://github.com/hadley/tidyr) - Easily tidy data
* [rlogging](https://github.com/mjkallen/rlogging) - simple logging
* [httr](https://github.com/hadley/httr) - easy HTTP
* [RCurl](http://cran.r-project.org/web/packages/RCurl/index.html) - curl from R (low-level)
* [jsonlite](https://github.com/jeroenooms/jsonlite) - A Robust, High Performance JSON Parser and Generator for R
* [opencpu](https://github.com/jeroenooms/opencpu) - OpenCPU system for embedded scientific computation and reproducible research (or a way to expose your R package through HTTP)
Books
* [Advanced R](http://adv-r.had.co.nz/) - _the_ book for learning R programming
* [R Packages](http://r-pkgs.had.co.nz/) - learn how to write R packages
* [ggplot2: Elegant Graphics for Data Analysis](http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403) - a good introduction to the ideas behind the grammar of graphics as implemented in R; a bit old now but still a very good read
* [Seamless R and C++ Integration](http://www.springer.com/statistics/computational+statistics/book/978-1-4614-6867-7) - RCpp
Interesting projects
* [ropensci](http://ropensci.org/) - access to open scientific data from R
* [revolutionR](http://www.revolutionanalytics.com/) - enterprise version of R
* [renjin](http://www.renjin.org/) - R on the JVM