forked from OHI-Science/data-science-training
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathggplot2.Rmd
486 lines (333 loc) · 21.7 KB
/
ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
# Visualizing: `ggplot2` {#ggplot2}
Why do we start with data visualization? Not only is data viz a big part of analysis, it’s a way to SEE your progress as you learn to code.
> "ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places." - [R4DS](http://r4ds.had.co.nz/data-visualisation.html)
This lesson borrows heavily from Hadley Wickham's [R for Data Science book](http://r4ds.had.co.nz/data-visualisation.html), and the Data Carpentry [R for Ecology curriculum](http://www.datacarpentry.org/R-ecology-lesson/).
```{r viz ops, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = F, message = F)
library(htmltools)
```
## Objectives & Resources
### Objectives
- install our first package, `ggplot2`, by installing `tidyverse`
- learn ggplot2 with `mpg` dataframe (important to play with other data than your own, you'll learn something.)
- practice writing a script in RMarkdown
- practice the rstudio-github workflow
### Resources
Here are some additional resources for data visualization in R:
- [ggplot2-cheatsheet-2.1.pdf](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf)
- [Interactive Plots and Maps - Environmental Informatics](http://ucsb-bren.github.io/env-info/wk06_widgets.html)
- [Graphs with ggplot2 - Cookbook for R](http://www.cookbook-r.com/Graphs/#graphs-with-ggplot2)
- [ggplot2 Essentials - STHDA](http://www.sthda.com/english/wiki/ggplot2-essentials)
- ["Why I use ggplot2" - David Robinson Blog Post](http://varianceexplained.org/r/why-I-use-ggplot2/)
## Install our first package: `tidyverse`
Packages are bundles of functions, along with help pages and other goodies that make them easier for others to use, (ie. vignettes).
So far we've been using packages that are already included in *base R*. These can be considered *out-of-the-box* packages and include things such as `sum` and `mean`. You can also download and install packages created by the vast and growing R user community. The most traditional place to download packages is from [CRAN, the Comprehensive R Archive Network](https://cran.r-project.org/). This is where you went to download R originally, and will go again to look for updates. You can also install packages directly from GitHub, which we'll do tomorrow.
You don't need to go to CRAN's website to install packages, we can do it from within R with the command `install.packages("package-name-in-quotes")`.
We are going to be using the package `ggplot2`, which is actually bundled into a huge package called `tidyverse`. We will install `tidyverse` now, and use a few functions from the packages within. Also, check out [tidyverse.org/](https://www.tidyverse.org).
```{r, eval=FALSE, messages=FALSE}
## from CRAN:
install.packages("tidyverse") ## do this once only to install the package on your computer.
```
```{r}
library(tidyverse) ## do this every time you restart R and need it
```
When you do this, it will tell you which packages are inside of `tidyverse` that have also been installed. Note that there are a few name conflicts; it is alerting you that we'll be using two functions from dplyr instead of the built-in stats package.
What's the difference between `install.packages()` and `library()`? Why do you need both? Here's an analogy:
- `install.packages()` is setting up electricity for your house. Just need to do this once (let's ignore monthly bills).
- `library()` is turning on the lights. You only turn them on when you need them, otherwise it wouldn't be efficient. And when you quit R, it turns the lights off, but the electricity lines are still there. So when you come back, you'll have to turn them on again with `library()`, but you already have your electricity set up.
You can also install packages by going to the Packages tab in the bottom right pane. You can see the packages that you have installed (listed) and loaded (checkbox). You can also install packages using the install button, or check to see if any of your installed packages have updates available (update button). You can also click on the name of the package to see all the functions inside it — this is a super helpful feature that I use all the time.
## Plotting with **`ggplot2`**
**`ggplot2`** is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.
ggplot likes data in the 'long' format: i.e., a column for every dimension, and a row for every observation. Well structured data will save you lots of time when making figures with ggplot.
ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
<br>
![](img/rstudio-cheatsheet-ggplot.png)
<br>
## Data
We are going to use the `mpg` dataset which provides information on fuel economy data for 38 car models.
This data comes preloaded with the `tidyverse` so it is already loaded into R. Let's take a look at it.
```{r}
mpg
```
This dataframe is already in a *long* format where all rows are an observation and all columns are variables. Among the variables in `mpg` are:
1. `displ`, a car's engine size, in litres.
1. `hwy`, a car's fuel efficiency on the highway, in miles per gallon (mpg).
A car with a low fuel efficiency consumes more fuel than a car with a high
fuel efficiency when they travel the same distance.
To learn more about `mpg`, open its help page by running `?mpg`.
Now we're going to visualize the data that is held in this dataframe.
To build a ggplot, we need to:
- use the `ggplot()` function and bind the plot to a specific data frame using the
`data` argument
```{r, eval=FALSE, purl=FALSE}
ggplot(data = mpg)
```
- define aesthetics (`aes`), by selecting the variables to be plotted and the
variables to define the presentation such as plotting size, shape color, etc.
Run this code to put `displ` on the x-axis and `hwy` on the y-axis:
```{r, eval=FALSE, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy))
```
- add `geoms` -- graphical representation of the data in the plot (points,
lines, bars). **`ggplot2`** offers many different geoms; we will use some
common ones today, including:
* `geom_point()` for scatter plots, dot plots, etc.
* `geom_bar()` for bar charts
* `geom_line()` for trend lines, time-series, etc.
To add a geom to the plot use `+` operator. Because we have two continuous variables,
let's use `geom_point()` first:
```{r first-ggplot, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
```
The `+` in the **`ggplot2`** package is particularly useful because it allows you
to modify existing `ggplot` objects. This means you can easily set up plot
"templates" and conveniently explore different types of plots, so the above
plot can also be generated with code like this:
```{r, first-ggplot-with-plus, eval=FALSE, purl=FALSE}
# Assign plot to a variable
car_plot <- ggplot(data = mpg, aes(x = displ, y = hwy))
# Draw the plot
car_plot +
geom_point()
```
Notes:
- Anything you put in the `ggplot()` function can be seen by any geom layers
that you add (i.e., these are universal plot settings). This includes the x and
y axis you set up in `aes()`.
- You can also specify aesthetics for a given geom independently of the
aesthetics defined globally in the `ggplot()` function.
- The `+` sign used to add layers must be placed at the end of each line containing
a layer. If, instead, the `+` sign is added in the line before the other layer,
**`ggplot2`** will not add the new layer and will return an error message.
> **STOP: let's Commit, Pull and Push to GitHub**
## Building your plots iteratively
Building plots with ggplot is typically an iterative process. We start by
defining the dataset we'll use, lay the axes, and choose a geom:
```{r create-ggplot-object, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point()
```
Then, we start modifying this plot to extract more information from it. For
instance, we can add transparency (`alpha`) to avoid overplotting:
```{r adding-transparency, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(alpha = 0.4)
```
Or to color each class of car in the plot differently:
```{r color-by-species, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class))
```
In the above example, we mapped `class` to the color aesthetic, but we could have mapped `class` to the shape aesthetic in the same way. In this case, the shape of each point would reveal its class affiliation.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
```
What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.
> ### Exercise
>
> Make a scatterplot of `hwy` vs `cty` with different size points representing each car class and different colors for each fuel type.
>
We get a _warning_ here, because mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is not a good idea.
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, size = class, color = fl))
```
We can also add colors for all the points. Here, the color doesn't convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes _outside_ of `aes()`. You'll need to pick a value that makes sense for that aesthetic:
* The name of a color as a character string.
* The size of a point in mm.
* The shape of a point as a number.
```{r adding-colors, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(alpha = 0.4, color = "blue")
```
> ### Exercise
> 1. What's gone wrong with this code?
>
```{r, eval=F}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy, color = "blue"))
```
>
>
> 2. Plot highway mpg (`hwy`) vs engine size (`displ`) and have the point color to indicate the city mpg (`cty`).
> 3. Now instead of color, use shape to indicate city mpg (`cty`). Why are these two aesthetics behaving differently?
> 4. What happens if you map an aesthetic to something other than a variable
name, like `aes(colour = displ < 5)`?
```{r, echo=F, eval=F}
ggplot(data = mpg, aes(x = displ, y = hwy, color = cty)) +
geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy, shape = cty)) +
geom_point()
ggplot(data = mpg, aes(x = displ, y = hwy, color = cty < 20)) +
geom_point()
```
>
> **STOP: commit, pull and push to github**
>
## Faceting
ggplot has a special technique called *faceting* that allows the user to split one plot into multiple plots based on a factor included in the dataset. We will use it to make a plot of the highway miles (`hwy`) a car gets by its engine size (`displ`) for each car manufacturer:
```{r first-facet, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ manufacturer)
```
We can now make the faceted plot by splitting further by class using `color` (within a single plot):
```{r facet-by-species-and-sex, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
facet_wrap(~ manufacturer)
```
Usually plots with white background look more readable when printed. We can set
the background to white using the function `theme_bw()`.
```{r facet-by-species-and-sex-white-bg, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
facet_wrap(~ manufacturer) +
theme_bw()
```
## **`ggplot2`** themes
In addition to `theme_bw()`, which changes the plot background to white, **`ggplot2`** comes with several other themes which can be useful to quickly change the look of your visualization.
The [ggthemes](https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html) package provides a wide variety of options (including an Excel 2003 theme). The [**`ggplot2`** extensions website](https://www.ggplot2-exts.org) provides a list of packages that extend the capabilities of **`ggplot2`**, including additional themes.
>
> ### Exercise
>
> Spend a couple minutes trying out your plot with different plot themes. The complete list of themes in `ggplot2` is available at <http://docs.ggplot2.org/current/ggtheme.html>. But for some more interesting themes, try installing the `ggthemes` package and using one of those themes. You can find more information on this package at <https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html>
>
## Geometric objects (geoms)
A __geom__ is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. You can use different geoms to plot the same data. To change the geom in your plot, change the geom function that you add to `ggplot()`. Let's look at a few ways of viewing the distribution of highway mpg (`hwy`) for each driving class (`drv`).
```{r}
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_jitter()
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot()
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_violin()
```
`geom_smooth` allows you to view a smoothed mean of data. Here we look at the smooth mean of the highway mpg (`hwy`) by engine size (`displ`).
```{r}
ggplot(data = mpg) +
geom_smooth(aes(x = displ, y = hwy))
```
`ggplot2` provides over 30 geoms, and extension packages provide even more (see <https://www.ggplot2-exts.org> for a sampling). The best way to get a comprehensive overview is the [ggplot2 cheatsheet](http://rstudio.com/cheatsheets). To learn more about any single geom, use help: `?geom_smooth`.
To display multiple geoms in the same plot, add multiple geom functions to `ggplot()`:
```{r}
ggplot(data = mpg) +
geom_point(aes(x = displ, y = hwy)) +
geom_smooth(aes(x = displ, y = hwy))
```
Notice that this plot contains two geoms in the same graph!
This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display `cty` instead of `hwy`. You'd need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to `ggplot()`. ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:
```{r, eval = FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```
If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings _for that layer only_. This makes it possible to display different aesthetics in different layers.
```{r}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
```
## Customization
Take a look at the [**`ggplot2`** cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf), and think of ways you could improve the plot.
Now, let's change names of axes to something more informative than 'hwy' and 'displ' and add a title to the figure:
```{r number-species-year-with-right-labels, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
labs(title = "Relationship between engine size and miles per gallon (mpg)",
x = "Highway MPG",
y = "Engine displacement (liters)") +
theme_bw()
```
The axes have more informative names, but their readability can be improved by increasing the font size:
```{r number-species-year-with-right-labels-xfont-size, purl=FALSE}
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
labs(title = "Relationship between engine size and mpg",
x = "Highway MPG",
y = "Engine displacement (liters)") +
theme_bw() +
theme(text=element_text(size = 16))
```
> ### Challenge
> With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio [**`ggplot2`** cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) for inspiration.
> Here are some ideas:
> * See if you can change the thickness of the lines.
> * Can you find a way to change the name of the legend? What about its labels?
> * Try using a different color palette (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/).
## Bar charts
Next, let's take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with `geom_bar()`. The following chart displays the total number of cars in the `mpg` dataset, grouped by `fl` (fuel type).
```{r}
ggplot(data = mpg, aes(x = fl)) +
geom_bar()
```
On the x-axis, the chart displays `fl`, a variable from `mpg`. On the y-axis, it displays count, but count is not a variable in `mpg`! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:
* bar charts, histograms, and frequency polygons bin your data
and then plot bin counts, the number of points that fall in each bin.
* smoothers fit a model to your data and then plot predictions from the
model.
* boxplots compute a robust summary of the distribution and then display a
specially formatted box.
The algorithm used to calculate new values for a graph is called a __stat__, short for statistical transformation.
You can learn which stat a geom uses by inspecting the default value for the `stat` argument. For example, `?geom_bar` shows that the default value for `stat` is "count", which means that `geom_bar()` uses `stat_count()`. `stat_count()` is documented on the same page as `geom_bar()`, and if you scroll down you can find a section called "Computed variables". That describes how it computes two new variables: `count` and `prop`.
ggplot2 provides over 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. `?stat_bin`. To see a complete list of stats, try the ggplot2 cheatsheet.
### Position adjustments
There's one more piece of magic associated with bar charts. You can colour a bar chart using either the `color` aesthetic, or, more usefully, `fill`:
```{r}
ggplot(data = mpg, aes(x = fl, fill = fl)) +
geom_bar()
```
This isn't particularly useful since both the geom and the color are displaying the same information. Instead, map the fill aesthetic to another variable, like `class`: the bars are automatically stacked. Each colored rectangle represents a combination of `fl` and `class`.
```{r}
ggplot(data = mpg, aes(x = fl, fill = class)) +
geom_bar()
```
The stacking is performed automatically by the __position adjustment__ specified by the `position` argument. If you don't want a stacked bar chart, you can use `"dodge"` or `"fill"`.
* `position = "fill"` works like stacking, but makes each set of stacked bars
the same height. This makes it easier to compare proportions across
groups.
```{r}
ggplot(data = mpg, aes(x = fl, fill = class)) +
geom_bar(position = "fill")
```
* `position = "dodge"` places overlapping objects directly _beside_ one
another. This makes it easier to compare individual values.
```{r}
ggplot(data = mpg, aes(x = fl, fill = class)) +
geom_bar(position = "dodge")
```
> ### Challenge
> With all of this information in hand, please take another five minutes to either
> improve one of the plots generated in this exercise or create a beautiful graph
> of your own. Use the RStudio [**`ggplot2`** cheat sheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf) for
> inspiration. Remember to use the help documentation (e.g. `?geom_bar`)
> Here are some ideas:
> * Plot different variables using `geom_bar()`. For what variables is this geom useful?
> * Flip the x and y axes.
> * Use `scale_x_discrete` to change the x-axis tick labels to "CNG", "Diesel", "Ethanol", "Premium", and "Regular"
```{r}
ggplot(data = mpg) +
geom_bar(aes(x = fl, fill = class), position = "dodge") +
scale_x_discrete(labels=c("CNG", "Diesel", "Ethanol", "Premium", "Regular")) +
xlab("Fuel type")
```
## Arranging and exporting plots
After creating your plot, you can save it to a file in your favorite format. The Export tab in the **Plot** pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters.
Instead, use the `ggsave()` function, which allows you easily change the dimension and resolution of your plot by adjusting the appropriate arguments (`width`, `height` and `dpi`):
```{r ggsave-example, eval=FALSE, purl=FALSE}
my_plot <- ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
labs(title = "Relationship between engine size and mpg",
x = "Highway MPG",
y = "Engine displacement (liters)") +
theme_bw() +
theme(text=element_text(size = 16))
ggsave("name_of_file.png", my_plot, width = 15, height = 10)
```
Note: The parameters `width` and `height` also determine the font size in the saved plot.
## Save and push to GitHub