forked from rstats-tln/making-plots-with-ggplot
-
Notifications
You must be signed in to change notification settings - Fork 8
/
01-visualisation.Rmd
323 lines (226 loc) · 11.4 KB
/
01-visualisation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
---
title: "Data visualisation"
author: "Taavi Päll"
date: "2020-09-07"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
We start with R from visualization and learn how to make basic plots and how to map variables to point color, size, or shape.
## Graphs
Making a good statistical graph is difficult and usually takes many iterations, where we progressively improve different aspects of our plot, such as scale, transformation of a variable, addition of information, color palette, markers, etc.
Three guiding principles to good graphs:
1. Make the data stand out:
- fill data region, transform, choose an appropriate scale for an axis, avoid chart junk, avoid having graph elements interfering with data, deal with overplotting (jittering, transparency)
2. Facilitate comparison:
- what is the important comparison and how to emphasize it -- different groups, bring out points, add reference markers, juxtapose and superpose groups, color palettes and color blindness
3. Add information by labeling axes and using labels, annotating points/lines
## Sources
- This tutorial is based on "Data visualization with ggplot2" chapter in [R4DS](http://r4ds.had.co.nz/data-visualisation.html) by G. Grolemund and H. Wickham.
- [lectures/graafilised-lahendused](https://rstats-tartu.github.io/lectures/graafilised-lahendused.html) and [learn-r/ggplot2](https://tpall.github.io/learn-r/#ggplot2) by Ülo Maiväli and Taavi Päll
## ggplot2
**ggplot2** is an R package for producing statistical graphics based on the grammar of graphics (hence the gg!).
Let's start by loading tidyverse (meta) library which provides us with set of necessary packages to start with data analysis and visualisations. Importantly, also **ggplot2** library is loaded.
```{r}
library(tidyverse)
library(here)
```
As you can see, running this line of code loads eight different packages (libraries) and warns that some of the functions (filter(), lag()) that were just loaded into namespace have identical name to functions that were already present in namespace.
These new functions masked old ones and, in case you want to use these masked functions, you need to call them explicitly by using package where it comes from `stats::filter()`.
If you get error message "there is no package called 'tidyverse'", then you need to install this package and run again `library()`:
```{r, eval=FALSE}
install.packages("tidyverse")
library(tidyverse)
```
## Data to visualise
Let's start with **ggplot2** built in *mpg* dataset with fuel economy data from 1999 and 2008 for 38 models of car:
```{r}
mpg # aka ggplot::mpg
```
If you worry where this dataset comes from, then there is no magic -- it's bundled with ggplot2 package and will be invisibly loaded every time when ggplot2 library is loaded. mpg invisibility means that, differently from your own R objects, it will not show up in your Environment panel.
For us, key variables in mpg dataset are:
- `displ` -- engine displacement (L),
- `hwy` -- highway miles per gallon
- `class` -- class/type of the car
```{r}
library(palmerpenguins)
penguins
```
## Creating a ggplot
Simple scatter plot to explore relationship between fuel consumption in highway traffic (hwy) and engine size (displ) is created like this.
Here we put displ on the x-axis and hwy on the y-axis:
```{r}
ggplot()
```
```{r}
ggplot(data = penguins) +
geom_point(mapping = aes(x = bill_length_mm, y = bill_depth_mm))
```
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
Here you can see negative relationship between engine size and fuel consumption. It's probably no news to anyone that cars with big engines consume more fuel.
## Composing a ggplot
+ **ggplot2 works iteratively** -- you start with a layer showing the raw data and then add layers of geoms, annotations, and statistical summaries.
To compose plots, you have to supply minimally:
+ **Data** that you want to visualize and
+ **aes**thetic **mappings** -- what's on x-axis, what's on y-axis, and how to you want to group and color your data. Mapped arguments must be found in your data!
+ **Layers** made up of **geom**etric elements: points, lines, boxes, etc. What's shown on plot.
visualisation of these three components within ggplot context looks like this:
```
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
### Excercises
1. Run ggplot(data = mpg). What do you see?
```{r}
ggplot(data = mpg)
```
2. How many rows are in mpg dataset? How many columns? Run "mpg"
```{r}
mpg
```
3. What does the drv variable describe? Read help for ?mpg to find out.
```{r}
?mpg
?penguins
```
4. Make scatterplot of *hwy* vs *cyl* using mpg data (what happens if you add some noise to points? use jitter):
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = cyl, y = hwy), position = "jitter")
```
There is also **geom jitter**, it's a convenient shortcut for geom_point(position = "jitter"). :
```{r}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = cyl, y = hwy))
```
5. What happens when you make a scatterplot of *class* versus *drv* using mpg data:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
```
Is such plot useful? When it could be useful?
## Aesthetic mappings
hwy ~ displ scatterplot tells us that there is linear relationship between engine size and fuel consumption: bigger engines use more fuel and are therefore less efficient.
Nevertheless, if we look at the cars with huge engines (>5L), it's apparent that there are some outliers (plot below) that perform better than cars in this engine class in general.
What are those cars? Do they have something in common?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = (hwy > 20 & displ > 5), shape = hwy > 40 & displ < 2))
```
To get a clue about the nature of these outliers, we would like add more info present in our mpg dataset to plot. To add more variables to 2D scatterplot, we can use additional aesthetic mappings.
> Aesthetics like color, shape, fill, and size can be used to add additional variables to a plot.
By mapping displ to x-axis and hwy to y-axis let's map additionally color of the points to class variable (color = class) in mpg dataset to reveal the class of each car:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
```
We can see that most of the cars with large motors with better fuel efficiency belong to sports cars (2seaters).
Exercise, make a scatterplot Body mass vs Flipper length and assign color according to species:
```{r}
if (!require(palmerpenguins)) install.packages("palmerpenguins")
library(palmerpenguins)
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species))
```
Let's recreate previous plot with class mapped to size of each point:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
```
Ouch, we get warning... seems its not a good idea. Why?
What's wrong with the next plot where we have four categories ("tiny", "small", "big", "very big") of, let's say diameters, and we want to map size aesthetic to diameter.
Here we create toy data frame where we have categorical variable diameter for four x values, y is all same.
```{r}
diameter <- data_frame(x = 1:4, y = 1, diameter = c("tiny", "small", "big", "very big"))
diameter
```
Now let's create scatterplot y~x and map diameter to point **size**:
```{r}
ggplot(data = diameter) +
geom_point(mapping = aes(x = x, y = y, size = diameter))
```
Maybe this plot explains why it's generally not a good idea to map categorical variable to size aesthetic.
### Excercise
1. Update code. Map *alpha* aesthetic to class:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
```
```{r}
ggplot(data = penguins) +
geom_point(mapping = aes(x = flipper_length_mm, y = body_mass_g, alpha = sex, color = species))
```
2. Update code. Map *shape* aesthetic to class. Update code so that there will be 7 shapes instead of default 6:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class)) +
scale_shape_manual(values = 1:7)
```
Is everything OK with shapes?
3. Using **dplyr** built-in starwars dataset (hint: starwars) make height~mass scatterplot and map one continuous variable to size:
```{r}
ggplot(data = starwars) +
geom_point(mapping = aes(height, mass, size = birth_year))
```
To do it right, we can use a much nicer function geom_label_repel(), and map the birth year to color.
```{r}
ggplot(data = starwars, mapping = aes(x = height, y = mass)) +
geom_point(mapping = aes(color = birth_year)) +
ggrepel::geom_label_repel(data = filter(starwars, mass > 1000), aes(label = name))
```
> When you set your aesthetic via mapping, ggplot automatically takes care of the rest: it finds best scale to display selected aesthetic and draws a legend. Note that this happens only when you map aesthetic within aes() function.
## Set aesthetic manually
You can change the appearance of your plot also manually: change the color or shape of all the points.
For example, let's suppose you want to make all points in plot blue in displ~hwy scatterplot using mpg data:
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue") # try HEX code "#0000ff" or rgb(0,0,1)
```
Here, the color is not connected to the variable in your dataset, but just changes the appearance of the plot.
```{r, eval=FALSE}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "class")
```
Therefore, to change the appearance of the plot, you need to assign value to aesthetic **outside** aes() in geom function.
You just have to pick a value that makes sense for that aesthetic:
- the name/code of the color as a character string ("blue", "#0000ff")
- size of a point in mm
- shape of a point as a number
Examples!
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class), size = 10)
```
### Shape codes
While colors and sizes are intuitive, it seems impossible to remember available point shape codes in R. The quickest way out of this is to know how to generate an example plot of the first 25 shapes quickly. Numbers next to shapes denote R shape number.
```{r}
ggplot(data = data_frame(x = rep(1:5, 5), y = rep(5:1, each = 5), shape = c(0:24))) +
geom_point(mapping = aes(x = x, y = y, shape = shape), fill = "green", color = "blue", size = 3) +
geom_text(mapping = aes(x = x, y = y, label = shape), hjust = 1.7) +
scale_shape_identity() +
theme(axis.text = element_blank(),
axis.title = element_blank())
```
Update code:
- Let's map more than 6 shapes to class and
- use fill = "green", color = "blue", to change the appearance of the ALL points, and
- adjust size of the ALL points (size = 3) for better visibility.
```{r}
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy, shape = class), fill = "green", color = "blue", size = 3) +
scale_shape_manual(values = c(16:19, 21:24))
```
Note the differences how fill and color work on different point shapes! Which is the default point shape in ggplot?
### Excercises
1. What's wrong with this plot? Why are points not blue? Can you fix the code?
```{r}
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
```
2. Which variables in mpg are categorical? Which variables are continuous? (type ?mpg to read the documentation for the dataset). How can you see when you run mpg?
```{r}
```