forked from noaa-iea/r3-train
-
Notifications
You must be signed in to change notification settings - Fork 0
/
visualize.Rmd
367 lines (262 loc) · 11 KB
/
visualize.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
---
pagetitle: Visualize
output:
html_document:
pandoc_args: [
"--number-offset=2"]
---
```{r, include=FALSE}
knitr::opts_chunk$set(echo = T, warning = F, message = F)
if (!require(librarian)){
install.packages("librarian")
library(librarian)
}
shelf(
htmltools, mitchelloharawild/icons)
```
# Visualize
## Learning Objectives {.unnumbered .objectives}
1. **Read** data table.
1. Get paths using `here::here()`.
2. Use `readr::read_csv()` instead of `read.csv()`. Read CSV directly from URL.
1. **Plot** with `ggplot2` using **g**rammar of **g**raphics:
1. Starting with a simple line plot, feed the **data** as the first argument, set the **aesthetics** `aes(x = time, y = revenue)` and add geometry type of line `+ geom_line()`.
1. Add a smooth layer `+ geom_smooth()` for visualizing trend.
1. Plot a histogram `geom_histogram()`.
1. Plot a series of data using aesthetic of color `aes(color = region)`.
1. Update labels `+ labs()`
1. Generate multiple plots based on a variable with `+ facet_wrap()`.
1. Show variation with a box plot `+ geom_boxplot()`.
1. Show variation with a violin plot `+ geom_violin()`.
1. Change the `theme()`, eg with `theme_classic()`.
1. Create **interactive** online plots using htmlwidgets R libraries:
1. `plotly::ggplotly()` to convert existing ggplot object to interactive plotly visualization.
1. `dygraphs` library for time series plots.
## Read Data
Open your `r3-exercises.Rproj` to launch RStudio into that project and set the working directory.
Create a new Rmarkdown file (RStudio menu File > New file > Rmarkdown...) called `visualize.Rmd`. Insert headers like last time followed by Chunks of R code according to the examples provided below.
I'll be copy/pasting during the demonstration but I encourage you to type out the text to enhance understanding.
Picking up with the table we downloaded last time ([2.1.4 Read table `read.csv()`]( https://noaa-iea.github.io/r3-train/manipulate.html)), let's read the data directly from the URL and use readr's `read_csv()`:
```{r}
# libraries
library(here)
library(readr)
library(DT)
# variables
url_ac <- "https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_AC.csv"
# if ERDDAP server down (Error in download.file) with URL above, use this:
# url_ac <- "https://raw.githubusercontent.com/noaa-iea/r3-train/master/data/cciea_AC.csv"
csv_ac <- here("data/cciea_AC.csv")
# download data
if (!file.exists(csv_ac))
download.file(url_ac, csv_ac)
# read data
d_ac <- read_csv(csv_ac, col_names = F, skip = 2)
names(d_ac) <- names(read_csv(csv_ac))
# show data
datatable(d_ac)
```
Note the use of functions in libraries `here` and `readr` that you may need to install from the **Packages** pane in RStudio.
There [`here::here()`](https://here.r-lib.org/) function starts the path based on looking for the `*.Rproj` file in the current working directory or higher level folder. In this case it should be the same folder as your current working directory so seems unnecessary, but it's good practice for other situations in which you start running Rmarkdown files stored in subfolders (in which case the evaluating R Chunks assume the working directory of the `.Rmd`).
I prefer `readr::read_csv()` over `read.csv()` since columns of `character` type are not converted to type `factor` by default. It will also default to being read in as a `tibble` rather than just a `data.frame`.
## Plot statically with `ggplot2`
### Simple line plot `+ geom_line()`
Let's start with a simple line plot of `total_fisheries_revenue_coastwide` (y axis) over `time` (x axis) using the _**grammar of graphics**_ principles by:
1. Feed the **data** as the first argument to `ggplot()`.
1. Set up the **aesthetics** `aes()` as the second argument for specifying the dimensions of the plot (`x` and `y`).
1. Add (`+`) the **geometry**, or plot type.
From the [Data Visualization with ggplot2 Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf) (RStudio menu Help > Cheat Sheets), we have these aesthetics to plot based on the value being continuous
`r img(src='figs/visualize/gg_line.png', width=400)`
```{r}
library(dplyr)
library(ggplot2)
# subset data
d_coast <- d_ac %>%
# select columns
select(time, total_fisheries_revenue_coastwide) %>%
# filter rows
filter(!is.na(total_fisheries_revenue_coastwide))
datatable(d_coast)
# ggplot object
p_coast <- d_coast %>%
# setup aesthetics
ggplot(aes(x = time, y = total_fisheries_revenue_coastwide)) +
# add geometry
geom_line()
# show plot
p_coast
```
### Trend line `+ geom_smooth()`
Add a smooth layer based on a linear model (`method = "lm"`).
`r img(src='figs/visualize/gg_smooth.png', width=400)`
```{r}
p_coast +
geom_smooth(method = "lm")
```
Try changing the `method` argument by looking at the help documentation `?geom_smooth`.
### Distribution of values `+ geom_histogram()`
What if you want to look at a distribution of the values? For instance, you might simulate future revenues by drawing from this distribution, in which case you would want to use `geom_histogram()`.
`r img(src='figs/visualize/gg_hist.png', width=400)`
```{r}
d_coast %>%
# setup aesthetics
ggplot(aes(x = total_fisheries_revenue_coastwide)) +
# add geometry
geom_histogram()
```
Try changing the `binwidth` parameter.
### Series line plot `aes(color = region)`
Next, let's also show the other regional values (`CA`, `OR` and `WA`; not `coastwide`) in the plot as a series with different colors. To do this, we'll want to **tidy** the data into _long_ format so we can have a column for `total_fisheries_revenue` and another `region` column to supply as the `group` and `color` aesthetics based on aesthetics we see are available for `geom_line()`:
`r img(src='figs/visualize/gg_line.png', width=400)`
```{r}
library(stringr)
library(tidyr)
d_rgn <- d_ac %>%
# select columns
select(
time,
starts_with("total_fisheries_revenue")) %>%
# exclude column
select(-total_fisheries_revenue_coastwide) %>%
# pivot longer
pivot_longer(-time) %>%
# mutate region by stripping other
mutate(
region = name %>%
str_replace("total_fisheries_revenue_", "") %>%
str_to_upper()) %>%
# filter for not NA
filter(!is.na(value)) %>%
# select columns
select(time, region, value)
# create plot object
p_rgn <- ggplot(
d_rgn,
# aesthetics
aes(
x = time,
y = value,
group = region,
color = region)) +
# geometry
geom_line()
# show plot
p_rgn
```
### Update labels `+ labs()`
Next, let's update the labels for the title, x and y axes, and the color legend:
```{r}
p_rgn <- p_rgn +
labs(
title = "Fisheries Revenue",
x = "Year",
y = "Millions $ (year 2015)",
color = "Region")
p_rgn
```
### Multiple plots with `facet_wrap()`
When you want to look at similar data one variable at a time, you can use `facet_wrap()` to display based on this variable.
`r img(src='figs/visualize/gg_facet.png', width=400)`
```{r}
p_rgn +
facet_wrap(vars(region))
```
The example above is not a very good one since you'd typically show facets based on a variable not already plotted.
### Bar plot `+ geom_col()`
Another common visualization is a bar plot. How many variables does `geom_bar()` use versus `geom_col()`?
`r img(src='figs/visualize/gg_bar.png', width=400)`
`r img(src='figs/visualize/gg_col.png', width=400)`
```{r}
library(glue)
library(lubridate)
yr_max <- year(max(d_rgn$time))
d_rgn %>%
# filter by most recent time
filter(year(time) == yr_max) %>%
# setup aesthetics
ggplot(aes(x = region, y = value, fill = region)) +
# add geometry
geom_col() +
# add labels
labs(
title = glue("Fisheries Revenue for {yr_max}"),
x = "Region",
y = "Millions $ (year 2015)",
fill = "Region")
```
Try using `color` instead of `fill` within the aesthetic `aes()`. What's the difference?
### Variation of series with `+ geom_boxplot()`
```{r}
d_rgn %>%
# setup aesthetics
ggplot(aes(x = region, y = value, fill = region)) +
# add geometry
geom_boxplot() +
# add labels
labs(
title = "Fisheries Revenue Variability",
x = "Region",
y = "Millions $ (year 2015)") +
# drop legend since redundant with x axis
theme(
legend.position = "none")
```
### Variation of series with `+ geom_violin()`
```{r}
p_rgn_violin <- d_rgn %>%
# setup aesthetics
ggplot(aes(x = region, y = value, fill = region)) +
# add geometry
geom_violin() +
# add labels
labs(
title = "Fisheries Revenue Variability",
x = "Region",
y = "Millions $ (year 2015)") +
# drop legend since redundant with x axis
theme(
legend.position = "none")
p_rgn_violin
```
### Change Theme `theme()`
We've already manipulated the `theme()` in dropping the legend. You can create your own theme or use some of the existing.
```{r}
p_rgn_violin +
theme_classic()
```
## Plot interactively with `plotly` or `dygraphs`
### Make ggplot interactive with `plotly::ggplotly()`
When rendering to HTML, you can render most `ggplot` objects interactively with [`plotly::ggplotly()`](https://plotly.com/ggplot2). The `plotly` library is an R [htmlwidget](http://www.htmlwidgets.org) providing simple R functions to render interactive JavaScript visualizations.
```{r}
plotly::ggplotly(p_rgn)
```
**Interactivity**. Notice how now you can see a tooltip on hover of the data for any point of data. You can also use plotly's toolbar to zoom in/out, turn any series on/off by clicking on item in legend, and download a png.
### Create interactive time series with `dygraphs::dygraph()`
Another htmlwidget plotting library written more specifically for time series data is [`dygraphs`](https://rstudio.github.io/dygraphs). Unlike the ggplot2 data input, a series is expected in _wide_ (not tidy _long_) format. So we use tidyr's `pivot_wider()` first.
```{r}
library(dygraphs)
d_rgn_wide <- d_rgn %>%
mutate(
Year = year(time)) %>%
select(Year, region, value) %>%
pivot_wider(
names_from = region,
values_from = value)
datatable(d_rgn_wide)
d_rgn_wide %>%
dygraph() %>%
dyRangeSelector()
```
## Further Reading {-}
Introductory `ggplot2` topics not yet covered above are:
1. Other plot types: scatter, area, polar, ....
1. Changing scales of axes, color, shape and size with `scale_*()` functions.
1. Transforming coordinate system, eg `coord_flip()` to swap x and y axes for different orientation.
1. Adding text annotations.
1. Changing margins.
1. Summarization methods with `stat_*()` functions.
Here are further resources:
- [Learning ggplot2 | ggplot2](https://ggplot2.tidyverse.org/#learning-ggplot2)
- [ggplot2: Elegant Graphics for Data Analysis](https://ggplot2-book.org/index.html): online book by Hadley Wickham
- [3. Data visualisation | R for Data Science](https://r4ds.had.co.nz/data-visualisation.html?q=ggplo#data-visualisation): chapter from online book by Hadley Wickham and Garrett Grolemund
- [R Graphics Cookbook, 2nd edition](https://r-graphics.org/)