-
Notifications
You must be signed in to change notification settings - Fork 24
/
ggplot2.qmd
832 lines (573 loc) · 36.7 KB
/
ggplot2.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
---
output: html_document
editor_options:
chunk_output_type: console
---
# ggplot2 plotting environment
```{r echo=FALSE}
source("libs/Common.R")
options(width = 1000)
```
```{r echo = FALSE}
pkg_ver(c("dplyr", "ggplot2","forcats","scales"))
```
## Sample data
The data files used in this tutorial can be downloaded from the course's website as follows:
```{r download_data}
load(url("https://github.com/mgimond/ES218/blob/gh-pages/Data/dat1_2.RData?raw=true"))
```
This should load several data frame objects into your R session (note that not all are used in this exercise). The `dat1l` dataframe is a long table version of the crop yield dataset.
```{r}
head(dat1l, 3)
```
`dat1l2` adds `Country` to the `dat1l` dataframe.
```{r}
head(dat1l2, 3)
```
The `dat1w` dataframe is a wide table version of `dat1l`.
```{r}
head(dat1w, 3)
```
The `dat2` dataframe is a wide table representation of income by county and by various income and educational attainment levels. The first few lines and columns are shown next:
```{r}
dat2[1:3, 1:7]
```
`dat2c` is a long version of `dat2`
```{r}
head(dat2c, 3)
```
## The `ggplot2` package
The `ggplot2` package is designed around the idea that statistical graphics can be decomposed into a formal system of grammatical rules. The `ggplot2` learning curve is the steepest of all graphing environments encountered thus far, but once mastered it affords the greatest control over graphical design. For an up-to-date list of `ggplot2` functions, you may want to refer to ggplot2's [website](https://ggplot2.tidyverse.org/reference/).
A plot in `ggplot2` consists of different *layering* components, with the three primary components being:
+ The **dataset** that houses the data to be plotted;
+ The **aesthetics** which describe how data are to be mapped to the geometric elements (color, shape, size, etc..);
+ The **geometric** elements to use in the plot (i.e. points, lines, rectangles, etc...).
Additional (optional) layering components include:
+ **Statistical** elements such as smoothing, binning or transforming the variable
+ **Facets** for conditional or trellis plots
+ **Coordinate systems** for defining the plots shape (i.e. cartesian, polar, spatial map projections, etc...)
To access `ggplot2` functions, you will need to load its package:
```{r}
library(ggplot2)
```
From a grammatical perspective, a scientific graph is the conversion of *data* to **aesthetic** attributes and **geometric** objects. This is an important concept to grasp since it underlies the construction of all graphics in `ggplot2`.
For example, if we want to generate a point plot of crop yield as a function of year using the `dat1l` data frame, we type:
```{r fig.height=3, fig.width=5}
ggplot(dat1l , aes(x = Year, y = Yield)) + geom_point()
```
where the function, `ggplot()`, is passed the data frame name whose contents will be plotted; the `aes()` function is given data-to-geometry mapping instructions (`Year` is mapped to the x-axis and `Yield` is mapped to the y-axis); and `geom_line()` is the geometry type.
![](img/ggplot_1.png)
If we wanted to include a third variable such as crop type (`Crop`) to the map, we would need to map its aesthetics: here we'll map `Crop` to the `color` aesthetic..
```{r fig.height=3, fig.width=5}
ggplot(dat1l , aes(x = Year, y = Yield, color = Crop)) + geom_point()
```
The parameter `color` acts as a grouping parameter whereby the groups are assigned unique colors.
![](img/ggplot_2.png)
If we want to plot lines instead of points, simply substitute the geometry type with the `geom_line()` geometry.
```{r fig.height=3, fig.width=5}
ggplot(dat1l , aes(x = Year, y = Yield, color = Crop)) + geom_line()
```
Note that the aesthetics are still mapped in the same way with `Year` mapped to the x coordinate, `Yield` mapped to the y coordinate and `Crop` mapped to the geom's color.
Also, note that the parameters `x=` and `y=` can be omitted from the syntax reducing the line of code to:
```{r eval = FALSE}
ggplot(dat1l , aes(Year, Yield, color = Crop)) + geom_line()
```
## Geometries
Examples of a few available geometric elements follow.
### `geom_line`
`geom_line` generates line geometries. We'll use data from `dat1w` to generate a simple plot of oat yield as a function of year.
```{r fig.height=3, fig.width=5}
ggplot(dat1w, aes(x = Year, y = Oats)) + geom_line()
```
Parameters such as color and linetype can be passed directly to the `geom_line()` function:
```{r fig.height=3, fig.width=5}
ggplot(dat1w, aes(x = Year, y = Oats)) +
geom_line(linetype = 2, color = "blue", linewidth=0.4)
```
Note the difference in how `color=` is implemented here. It's no longer **mapping** a variable's levels to a range of colors as when it's called inside of the `aes()` function, instead, it's **setting** the line color to `blue`.
### `geom_point`
This generates point geometries. This is often used in generating scatterplots. For example, to plot male income (variable `B20004013`) vs female income (variable `B20004007`), type:
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3)
```
We modify the point's transparency by passing the `alpha=0.3` parameter to the `geom_point` function. Other parameters that can be passed to point geoms include `color`, `pch` (point symbol type) and `cex` (point size as a fraction).
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) +
geom_point(color = "red", pch=3 , alpha = 0.3, cex=0.6)
```
### `geom_hex`
When a bivariate scatter plot has too many overlapping points, it may be helpful to *bin* the observations into regular hexagons. This provides the number of observations per bin.
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) +
geom_hex(binwidth = c(1000, 1000))
```
The `binwidth` argument defines the width and height of each bin in the variables' axes units.
### `geom_boxplot`
In the following example, a boxplot of `Yield` is generated for each crop type.
```{r fig.height=3, fig.width=4}
ggplot(dat1l, aes(x = Crop, y = Yield)) + geom_boxplot(fill = "bisque")
```
If we want to generate a single boxplot (for example for all yields irrespective of crop type) we need to pass a *dummy* variable to `x=`:
```{r fig.height=3, fig.width=4}
ggplot(dat1l, aes(x = "", y = Yield)) +
geom_boxplot(fill = "bisque") + xlab("All crops")
```
### `geom_violin`
A violin plot is a symmetrical version of a density plot which provides greater detail of a sample's distribution than a boxplot.
```{r fig.height=3, fig.width=4}
ggplot(dat1l, aes(x = "", y = Yield)) + geom_violin(fill = "bisque")
```
### `geom_histogram`
Histograms can only be plotted for single variables (unless faceting is used) as can be noted by the absence of a `y=` parameter in `aes()`:
```{r fig.height=3, fig.width=4}
ggplot(dat1w, aes(x = Oats)) + geom_histogram(fill = "grey50")
```
The bin widths can be specified in terms of the value's units. In our example, the unit is yield of oats (in Hg/Ha). So if we want to generate bin widths that cover 1000 Hg/Ha, we can type,
```{r fig.height=3, fig.width=4}
ggplot(dat1w, aes(x = Oats)) +
geom_histogram(fill = "grey50", binwidth = 1000)
```
If you want to control the number of bins, use the parameter `bins=` instead. For example, to set the number of bins to 8, modify the above code chunk as follows:
```{r fig.height=3, fig.width=4}
ggplot(dat1w, aes(x = Oats)) +
geom_histogram(fill = "grey50", bins = 8)
```
### `geom_bar`
Bar plots are used to summaries the counts of a categorical value. For example, to plot the number of counties in each state (noting that each record in `dat2` is assigned a county):
```{r fig.height=2.5, fig.width=8}
ggplot(dat2, aes(State)) + geom_bar()
```
To sort the bars by length we need to rearrange the `State` factor level order based on the number of counties in each state (which is the number of times a state appears in the data frame). We'll make use of `forcats`'s `fct_infreq` function to reorder the State factor levels based on frequency.
```{r fig.height=2.5, fig.width=8}
library(forcats)
ggplot(dat2, aes(fct_infreq(State,ordered = TRUE))) + geom_bar()
```
If we want to reverse the order (i.e. plot from smallest number of counties to greatest), wrap the `fct_infreq` function with `fct_rev`.
```{r fig.height=2.5, fig.width=8}
ggplot(dat2, aes(fct_rev(fct_infreq(State,ordered = TRUE)))) + geom_bar()
```
The `geom_bar` function can also be used with count values (i.e. variable already summarized by count). First, we'll summaries the number of counties by state using the `dplyr` package. This will generate a data frame with just 51 records: one for each of the 50 states and the District of Columbia.
```{r fig.height=2.5, fig.width=8}
library(dplyr)
dat2.ct <- dat2 %>% group_by(State) %>%
summarize(Counties = n())
head(dat2.ct)
```
When using summarized data, we must pass the parameter `stat="identity"` to the `geom_bar` function. We must also explicitly map the *x* and *y* axes geometries. To order the bar heights in ascending or decending order, we can make use of the generic `reorder` function. This function will be passed two parameters: the variable to be ordered (`State`), the variable whose values will determine the order (`Counties`). Note that this differs from the way the `reorder` function was used in [the base plotting](./base_plots.html) chapter where a third argument, `median`, was passed to the function due to there being more than one value per grouping variable.
```{r fig.height=2.5, fig.width=8}
ggplot(dat2.ct, aes(x = reorder(State, Counties), y = Counties)) +
geom_bar(stat = "identity")
```
If you want to reverse the order, simply add a minus sign, `-`, to the `Counties` variable.
```{r fig.height=2.5, fig.width=8}
ggplot(dat2.ct, aes(x = reorder(State, -Counties), y = Counties)) +
geom_bar(stat = "identity")
```
### dot plot
The dot plot is an alternative way to visualize counts as a function of a categorical variable. Instead of mapping `State` to the x-axis, we'll map it to the y-axis.
```{r fig.height=4.5, fig.width=4}
ggplot(dat2.ct , aes(x = Counties, y = State)) + geom_point()
```
Dot plot graphics benefit from sorting--more so then bar plots.
```{r fig.height=4.5, fig.width=4}
ggplot(dat2.ct , aes(x = Counties, y = reorder(State, Counties))) +
geom_point()
```
### Combining geometries
Geometries can be layered. For example, to overlay a linear regression line to the data we can add the `stat_smooth` layer:
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) +
geom_point(alpha = 0.3) +
stat_smooth(method = "lm")
```
The `stat_smooth` can be used to fit other *lines* such as a loess:
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) +
geom_point(alpha = 0.3) +
stat_smooth(method = "loess")
```
The confidence interval can be removed from the smooth geometry by specifying `se = FALSE`.
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) +
geom_point(alpha = 0.3) +
stat_smooth(method = "loess", se = FALSE)
```
### Combining datasets
You can plot different datasets on the same ggplot canvas. This approach usually requires that each geom be assigned its own dataset and aesthetics. In the following example, we'll create two separate dataframes of grain data: one for barley and the other for oats. We'll then reference these datasets in separate calls to `geom_line`.
```{r fig.height=3, fig.width=4}
grain1 <- select(dat1w, Year, Barley)
grain2 <- select(dat1w, Year, Oats)
ggplot() +
geom_line(data = grain1, aes(Year, Barley), color = "blue", show.legend = TRUE) +
geom_line(data = grain2, aes(Year, Oats), color = "red", show.legend = TRUE)
```
You'll note that the legend is not automatically created since the aesthetics are not specified in the `ggplot()` function. To add a legend, we must first assign a character string to the `color` argument in `aes()`, then we reference that color aesthetic in the `scale_color_manual` function where we assign color to the aesthetic.
```{r fig.height=3, fig.width=4}
ggplot() +
geom_line(data = grain1, aes(Year, Barley, color = "Barley")) +
geom_line(data = grain2, aes(Year, Oats, color = "Oats")) +
scale_color_manual(name = "Grain", values = c("Barley" = "blue", "Oats" = "red"))
```
Note how we removed the `color="blue"/"red"` arguments from each call to `geom_line` and added the `color` aesthetic in the `aes()` functions.
## Tweaking a ggplot2 graph
### Plot title
You can add a plot title using the `ggtitle` function.
```{r fig.height=2.5, fig.width=8}
ggplot(dat2, aes(State)) + geom_bar() + ggtitle("Number of counties by state")
```
### Axes titles
Axes titles can be explicitly defined using the `xlab()` and `ylab()` functions.
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)")
```
To remove axis labels, simply pass `NULL` to the functions as in `xlab(NULL)` and `ylab(NULL)`.
### Axes labels
You can customize an axis' label elements. If you are mapping continuous values along the x and y axes, use the `scale_x_continuous()` and `scale_y_continuous()` functions. For example, to specify where to place the tics and the accompanying labels, type:
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
scale_x_continuous(breaks = c(10000, 30000, 50000),
labels = c("$10,000", "$30,000", "$50,000"))
```
If you want to change the label formats whereby the numbers are truncated to a thousandth of their original value, you can make use of `unit_format()` from the `scales` package:
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x=B20004013, y=B20004007)) + geom_point(alpha=0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
scale_x_continuous(labels=scales::unit_format(suffix="k",
scale=0.001,
sep="")) +
scale_y_continuous(labels=scales::unit_format(suffix="k",
scale=0.001,
sep=""))
```
The `scales` package also has a `comma_format()` function that will add commas to large numbers:
```{r fig.height=3, fig.width=4}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format())
```
You can rotate axes labels using the `theme` function.
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
The `hjust` argument justifies the values horizontally. Its value ranges from `0` to `1` where `0` is completely left-justified and `1` is completely right-justified. Note that the justification is *relative* to the text's orientation and *not* to the axis. So it may be best to first rotate the label values and then to adjust justification based on the plot's look as needed.
If you want the label values rotated 90° you might also need to justify vertically (relative to the text's orientation) using the `vjust` argument where `0` is completely top-justified and `1` is completely bottom-justified.
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0))
```
### Axes limits
The axis range can be set using `xlim()` and `ylim()`.
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
xlim(10000, 75000) + ylim(10000, 75000)
```
However, if you are calling the `scale_x_continuous()` and `scale_y_continuous()` functions, you do not want to use `xlim` and `ylim` instead, you should add the `limit=` argument to the aforementioned functions. For example,
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
scale_x_continuous(limit = c(10000, 75000),
labels = scales::comma_format()) +
scale_y_continuous(limit = c(10000, 75000),
labels = scales::comma_format())
```
### Axes breaks
You can explicitly define the breaks with the `breaks` argument. Continuing with the last example, we get:
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
scale_x_continuous(limit = c(10000, 75000),
labels = scales::comma_format(),
breaks = c(10000, 30000, 50000, 70000)) +
scale_y_continuous(limit = c(10000, 75000),
labels = scales::comma_format(),
breaks = c(10000, 30000, 50000, 70000))
```
Note that the `breaks` argument can be used in conjunction with other arguments (as shown in this example), or by itself.
### Axes and data transformations
If you wish to apply a non-linear transformation to either axes (while preserving the *untransformed* axis values) add the `coord_trans()` function as follows:
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
coord_trans(x = "log")
```
You can also transform the y-axis by specifying the parameter `y=`. The `log` transformation defaults to the natural log. For a log base 10, use `"log10"` instead. For a square root transformation, use `"sqrt"`. For the inverse use `"reciprocal"`.
Advanced transformations can be called via the `scales` package. For example, to implement the box-cox transformation (with a power of `-0.3`), type:
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
xlab("Female income ($)") + ylab("Male income ($)") +
coord_trans(x = scales::boxcox_trans(-0.3))
```
Note that any statistical geom (such as the regression line) will be applied to the *un-transformed* data. So a linear model may end up looking non-linear after an axis transformation:
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
stat_smooth(method = "lm", se = FALSE) +
xlab("Female income ($)") + ylab("Male income ($)") +
coord_trans(x = "log")
```
If a linear fit is to be applied to the transformed data, a better alternative is to transform the values instead of the axes. The transformation can be done on the original data or it can be implemented in ggplot using the `scale_x_continuous` and `scale_y_continuous` functions.
```{r fig.height=3, fig.width=4, warning=FALSE}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
stat_smooth(method = "lm", se = FALSE) +
xlab("Female income ($)") + ylab("Male income ($)") +
scale_x_continuous(trans = "log", breaks = seq(10000,60000,10000))
```
The `scale_x_continuous` and `scale_y_continuous` functions will accept `scales` transformation parameters--e.g. `scale_x_continuous(trans = scales::boxcox_trans(-0.3))`. Note that the parameter `breaks` is not required but is used here to highlight the transformed nature of the axis.
### Aspect ratio
You can impose an aspect ratio to your plot using the `coord_equal()` function. For example, to set the axes units equal (in length) to one another set `ratio=1`:
```{r fig.height=3, fig.width=5}
ggplot(dat2, aes(x = B20004013, y = B20004007)) + geom_point(alpha = 0.3) +
stat_smooth(method = "lm") +
xlab("Female income ($)") + ylab("Male income ($)") +
coord_equal(ratio = 1)
```
### Colors
You can customize geom colors using one of two sets of color schemes: one for continuous values, the other for categorical (discrete) values.
+------------------------+----------------------+
| Continuous | Categorical |
+========================+======================+
|`scale_color_gradient` | `scale_color_hue` |
|`scale_color_gradient2` | `scale_color_grey` |
|`scale_color_distiller` | `scale_color_manual` |
|`scale_fill_gradient2` | `scale_color_brewer` |
|`scale_fill_gradient` | |
|`scale_fill_distiller` | |
+------------------------+----------------------+
A few examples follow.
#### Continuous color schemes
The following chunk of code summarizes `dat2` by tallying the number of counties in each state and by computing the median county income values.
```{r}
dat2.ct2 <- dat2 %>% group_by(State) %>%
summarize(Counties = n(), Income = median(B20004001))
head(dat2.ct2)
```
The following chunk applies a green to red color gradient fill to each bar based on the median county incomes. Note that we are using the summarized count table (and not the original `dat2` table). Recall that when plotting bars from counts that are already tabulated we must specify `stat="identity"` in the `geom_bar` function.
```{r fig.height=2.5, fig.width=8}
ggplot(dat2.ct2, aes(x = fct_reorder(State, Counties), y = Counties, fill = Income)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "green", high = "red")
```
The following chunk applies a divergent color scheme while allowing one to specify the central value of this scheme. Note that the colors are symmetrical about the midpoint which may result in only a partial range of the full possible gradient of colors.
```{r fig.height=2.5, fig.width=8}
ggplot(dat2.ct2, aes(x = fct_reorder(State, Counties), y = Counties, fill = Income)) +
geom_bar(stat = "identity") +
scale_fill_gradient2(low = "darkred", mid = "white", high = "darkgreen",
midpoint = 30892)
```
In the last two code chunks, we *filled* the bars with colors (note the use of functions with the string `_fill_`). When assigning color to point or line symbols, use the function with the `_color_` string. For example:
```{r fig.height=4.5, fig.width=4}
ggplot(dat2.ct2, aes(y = fct_reorder(State, Counties), x = Counties, col = Income)) +
geom_point() +
scale_color_gradient2(low = "darkred", mid = "white", high = "darkgreen",
midpoint = 30892)
```
#### Discrete color schemes
In the following chunk, we assign colors manually to each level in the variable `Yield`. The order of the color names mirror the order of the variable levels.
```{r fig.height=2.5, fig.width=4}
ggplot(dat1l, aes(Year, Yield, col = Crop)) +
geom_line() +
scale_color_manual(values = c("red", "orange", "green", "blue", "yellow"))
```
The following chunk applies a predefined discrete color scheme using one of Brewer's preset qualitative colors, `Dark2`, to each level.
```{r fig.height=2.5, fig.width=4}
ggplot(dat1l, aes(Year, Yield, col = Crop)) +
geom_line() +
scale_color_brewer(palette = "Dark2")
```
You can also apply sequential or divergent Brewer color schemes to variables having an implied order.
Let's assume that there is an implied order to the crop types. For example, we'll reorder the crop types based on their median yield (this creates an ordered factor from the `Crop` variable). We can then use one of Brewer's sequential color schemes such as `Reds`.
```{r fig.height=2.5, fig.width=4}
ggplot(dat1l, aes(Year, Yield, col = reorder(Crop, Yield, median))) +
geom_line() +
guides(color = guide_legend(title = "Crops")) +
scale_color_brewer(palette = "Reds")
```
Note that we've added a `guides()` function to rename the legend title. This is not needed to generate the sequential colors.
To reverse the color scheme, set the `direction` argument to `-1`.
```{r fig.height=2.5, fig.width=4}
ggplot(dat1l, aes(Year, Yield, col = reorder(Crop, Yield, median))) +
geom_line() +
guides(color = guide_legend(title = "Crops")) +
scale_color_brewer(palette = "Reds", direction = -1)
```
You can view a list of predefined Brewer color schemes by typing the following:
```{r fig.height=8, fig.width =10}
RColorBrewer::display.brewer.all()
```
### Adding mathematical symbols to a plot
You can embed math symbols using `plotmath`'s mathematical expressions by wrapping these expressions in an `expression()` function. For example,
```{r fig.height=2.5, fig.width=6}
ggplot(dat2, aes(x = B20004013^0.333, y = sqrt(B20004007))) + geom_point(alpha = 0.3) +
xlab( expression( ("Female income") ^ frac(1,3) ) ) +
ylab( expression( sqrt("Male income") ) )
```
To view the full list of mathematical expressions, type `?plotmath` at a command prompt.
## Faceting
### Faceting by categorical variable
### `facet_wrap`
Faceting (or conditioning on a variable) can be implemented in `ggplot2` using the `facet_wrap()` function.
```{r fig.height=4, fig.width=6}
ggplot(dat1l2, aes(x = Year, y = Yield, color = Crop)) + geom_line() +
facet_wrap( ~ Country, nrow = 1)
```
The parameter ` ~ Country` tells ggplot to condition the plots on country. If we wanted the plots to be stacked, we would set `nrow` to `2`.
We can also condition the plots on two variables such as `crop` and `Country`. (Note that we will also rotate the x-axis labels to prevent overlaps).
```{r fig.height=3, fig.width=8}
ggplot(dat1l2, aes(x = Year, y=Yield)) + geom_line() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_wrap(Crop ~ Country, nrow = 1)
```
#### Wrapping facet headers
If the header names get truncated in the plot header, you can opt to wrap the facet headers using the `label_wrap_gen()` function as an argument value to `labeler`. For example, to wrap the `United States of America` value, we'll specify the maximum number of characters per line using the `width` argument:
```{r fig.height=3, fig.width=8}
ggplot(dat1l2, aes(x = Year, y=Yield)) + geom_line() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
facet_wrap(Crop ~ Country, nrow = 1, labeller = label_wrap_gen(width = 12))
```
#### `facet_grid`
The above `facet_wrap` example generated unique combinations of the variables `Crop` and `Country`. But such plots are usually best represented in a grid structure where one variable is spread along one axis and the other variable is spread along another axis of the plot layout. This can be easily accomplished using the `facet_grid` function:
```{r fig.height=4, fig.width=6}
ggplot(dat1l2, aes(x = Year, y = Yield)) + geom_line() +
facet_grid( Crop ~ Country)
```
### Faceting by continuous variable
In the above examples, we are faceting the plots based on a categorical variable: `Country` and/or `crop`. But what if we want to facet the plots based on a continuous variable? For example, we might be interested in comparing male and female incomes across different female income ranges. This requires that a new categorical field (a factor) be created assigning to each case (row) an income group. We can use the `cut()` function to accomplish this task (we'll also omit all values greater than 100,000):
```{r }
dat2c$incrng <- cut(dat2c$F , breaks = c(0, 25000, 50000, 75000, 100000) )
head(dat2c)
```
In the above code chunk, we create a new variable, `incrng`, which is assigned an income category group depending on which range `dat2c$F` (female income) falls into. The income interval breaks are defined in `breaks=`. In the output, you will note that the factor `incrng` defines a range of incomes (e.g. `(0 , 2.5e+04]`) where the parenthesis `(` indicates that the left-most value is exclusive and the bracket `]` indicates that the right-most value is inclusive.
However, because we did not create categories that covered all income values in `dat2c$F` we ended up with a few `NA`'s in the `incrng` column:
```{r}
summary(dat2c$incrng)
```
We will remove all rows associated with missing `incrng` values:
```{r}
dat2c <- na.omit(dat2c)
summary(dat2c$incrng)
```
We can list all unique levels in our newly created factor using the `levels()` function.
```{r}
levels(dat2c$incrng)
```
The intervals are not meaningful displayed as is (particularly when scientific notation is adopted). So, we will assign more meaningful names to the factor levels as follows:
```{r}
levels(dat2c$incrng) <- c("Under 25k", "25k-50k", "50k-75k", "75k-100k")
head(dat2c)
```
Note that the order in which the names are passed must match that of the original breaks.
Now we can facet male vs. female scatter plots by income ranges. We will also throw in a best fit line to the plots.
```{r fig.height=3, fig.width=8, warning=FALSE}
ggplot(dat2c, aes(x = F, y = M)) + geom_point(alpha=0.2, pch=20) +
stat_smooth(method = "lm", col = "red") +
facet_grid( . ~ incrng)
```
One reason we would want to explore our data across different ranges of value is to assess the consistency in relationship between variables. In our example, this plot helps assess whether the relationship between male and female income is consistent across income groups.
## Adding 45° slope using `geom_abline`
In this next example, we will add a 45° line using `geom_abline` where the intercept will be set to `0` and the slope to 1. This will help visualize the discrepancy between the batches of values. So if a point lies above the 45° line, then the male's income is greater, if the point lies below the line, then the female's income is greater.
To help highlight differences in income, we will make a few changes to the faceted plots. First, we will reduce the y-axis range to $0-$150k (this will remove a few points from the data); we will force the x-axis and y-axis units to match so that a unit of $50k on the x-axis has the same length as that on the y-axis. We will also reduce the number of x tics and assign shorthand notation to income values (such as "50k" instead of "50000"). All this can be accomplished by adding the `scale_x_continuous()` function to the stack of ggplot elements.
```{r fig.height=3, fig.width=8, warning=FALSE}
ggplot(dat2c, aes(x = F, y = M)) + geom_point(alpha = 0.2, pch = 20, cex = 0.8) +
ylim(0, 150000) +
stat_smooth(method = "lm", col = "red") +
facet_grid( . ~ incrng) +
coord_equal(ratio = 1) +
geom_abline(intercept = 0, slope = 1, col = "grey50") +
scale_x_continuous(breaks = c(50000, 100000), labels = c("50k", "100k"))
```
Note the change in regression slope for the last facet. Note that the `stat_smooth` operation is only applied to the data limited to the axis range defined by `ylim`.
Now let's look at the same data but this time conditioned on educational attainment.
```{r fig.height=3, fig.width=10, warning=FALSE}
# Plot M vs F by educational attainment except for Level == All
ggplot(dat2c, aes(x = F, y = M)) + geom_point(alpha = 0.2, pch = 20, cex = 0.8) +
ylim(0, 150000) +
stat_smooth(method = "lm", col = "red") +
facet_grid( . ~ Level) +
coord_equal(ratio = 1) +
geom_abline(intercept = 0, slope = 1, col = "grey50") +
scale_x_continuous(breaks = c(50000, 100000), labels =c("50k", "100k"))
```
We can also condition the plots on two variables. For example: educational attainment and region.
```{r fig.height=8, fig.width=10, warning= FALSE}
ggplot(dat2c, aes(x = F, y = M)) + geom_point(alpha = 0.2, pch = 20, cex = 0.8) +
ylim(0, 150000) +
stat_smooth(method = "lm", col = "red") +
facet_grid( Region ~ Level) +
coord_equal(ratio = 1) +
geom_abline(intercept = 0, slope = 1, col = "grey50") +
scale_x_continuous(breaks = c(50000, 100000), labels = c("50k", "100k"))
```
## Tile plots (heat maps)
You can create so-called *heat maps* by tiling the data. This typically requires the use of three variables--two of which are either categorical or have equally spaced continuous values that define a rectangular grid layout, and the third that defines the grid cells' color. For example, a tile plot can be created showing the median income (for all sexes) as a function of education level and region.
We will first summarize our data to generate a three variable dataset.
```{r}
dat2c.med <- dat2c %>%
filter(Level != "All") %>%
group_by(Level, Region) %>%
summarise(Income = median(All))
head(dat2c.med)
```
Next, we will assign the `Region` variable to the x-axis, the `Level` variable to the y-axis and the `Income` values will be used to define the fill colors.
```{r fig.height=3, fig.width=4}
ggplot(dat2c.med, aes(x = Region, y = Level, fill = Income)) + geom_tile() +
scale_fill_gradient(low = "yellow", high = "red")
```
The above example adopts a continuous color scheme. If you want to bin the color swatches using user defined breaks, swap the `scale_fill_gradient` function with the `scale_fill_binned` function.
```{r fig.height=3, fig.width=4}
ggplot(dat2c.med, aes(x = Region, y = Level, fill = Income)) + geom_tile() +
scale_fill_binned(low = "yellow", high = "red",
breaks = c(21000, 28000, 32000, 42000, 53000))
```
As of `ggplot2` version `3.3`, you can use the `guide_colorsteps` function to control the *look* of your legend. In the last figure, the breaks are not even, yet the legend splits the color swatches into equal length units. Setting the `even.steps` argument to `FALSE` scales the color swatches to match the true interval lengths.
```{r fig.height=3, fig.width=4}
ggplot(dat2c.med, aes(x = Region, y = Level, fill = Income)) + geom_tile() +
scale_fill_binned(low = "yellow", high = "red",
breaks = c(21000, 28000, 32000, 42000, 53000),
guide = guide_colorsteps(even.steps = FALSE))
```
The `scale_fill_binned` function offers additional control over the legend such as its height (`barheight`), width (`barwidth`) and the display of the minimum and maximum values (`show.limits`).
```{r fig.height=3, fig.width=4}
ggplot(dat2c.med, aes(x = Region, y = Level, fill = Income)) + geom_tile() +
scale_fill_binned(low = "yellow", high = "red",
breaks = c(21000, 28000, 32000, 42000, 53000),
guide = guide_colorsteps(even.steps = FALSE,
barheight = unit(2.3, "in"),
barwidth = unit(0.1, "in"),
show.limits = TRUE))
```
## Exporting to an image
You can export a ggplot figure to an image using the `ggsave` function. For example,
```{r}
p1 <- ggplot(dat1l2, aes(x = Year, y = Yield, color = Crop)) + geom_line() +
facet_wrap( ~ Country, nrow = 1) +
scale_y_continuous(labels = scales::comma_format())
ggsave("fig0.png", plot = p1, width = 6, height = 2, units = "in", device = "png")
```
![](fig0.png)
The `width` and `height` arguments are defined in units of inches, `in`. You can also specify these parameters in units of centimeters by setting `units = "cm"`. The `device` argument controls the image file type. Other file types include `"jpeg"`, `"tiff"`, `"bmp"` and `"svg"` just to name a few.
For greater control of the font sizes, you need to make use of the `theme` function when buiding the plot.
```{r}
p1 <- ggplot(dat1l2, aes(x = Year, y = Yield, color = Crop)) + geom_line() +
facet_wrap( ~ Country, nrow = 1) +
scale_y_continuous(labels = scales::comma_format()) +
theme(axis.text = element_text(size = 8, family = "mono"),
axis.title = element_text(size = 11, face = "bold"),
strip.text = element_text(size = 11, face="italic", family = "serif"),
legend.title = element_text(size = 10, family = "sans"),
legend.text = element_text(size = 8, color = "grey40"))
ggsave("fig1.png", plot = p1, width = 6, height = 2, units = "in")
```
![](fig1.png)
The `family` argument controls the font type. It does not automatically access all the fonts in your operating system. The three R fonts accessible by default are `"serif"`, `"sans"` and `"mono"`. These are usually mapped to your system's fonts.
To access other fonts on your operating system, you will need to make use of the `showtext` package. The package is not covered in this tutorial, instead, refer to the package's [website](https://github.com/yixuan/showtext) for instructions on using the package.