-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03_exercises.Rmd
429 lines (302 loc) · 23.3 KB
/
03_exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
---
title: 'Weekly Exercises #3'
author: "Ana Espeleta"
output:
html_document:
keep_md: TRUE
toc: TRUE
toc_float: TRUE
df_print: paged
code_download: true
---
```{r setup, include=FALSE}
#knitr::opts_chunk$set(echo = TRUE, error=TRUE, message=FALSE, warning=FALSE)
```
```{r libraries}
library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
```
```{r data}
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday dog breed data
breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
# Tidy Tuesday data for challenge problem
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
```
## Setting up on GitHub!
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the "Cloning a repo" section) and watch the video [here](https://github.com/llendway/github_for_collaboration/blob/master/github_for_collaboration.md). Then, do the following (if you get stuck on a step, don't worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
* Create a repository on GitHub, giving it a nice name so you know it is for the 3rd weekly exercise assignment (follow the instructions in the document/video).
* Copy the repo name so you can clone it to your computer. In R Studio, go to file --> New project --> Version control --> Git and follow the instructions from the document/video.
* Download the code from this document and save it in the repository folder/project on your computer.
* In R Studio, you should then see the .Rmd file in the upper right corner in the Git tab (along with the .Rproj file and probably .gitignore).
* Check all the boxes of the files in the Git tab and choose commit.
* In the commit window, write a commit message, something like "Initial upload" would be appropriate, and commit the files.
* Either click the green up arrow in the commit window or close the commit window and click the green up arrow in the Git tab to push your changes to GitHub.
* Refresh your GitHub page (online) and make sure the new documents have been pushed out.
* Back in R Studio, knit the .Rmd file. When you do that, you should have two (as long as you didn't make any changes to the .Rmd file, in which case you might have three) files show up in the Git tab - an .html file and an .md file. The .md file is something we haven't seen before and is here because I included `keep_md: TRUE` in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).
* As you work through your homework, save and commit often, push changes occasionally (maybe after you feel finished with an exercise?), and go check to see what the .md file looks like on GitHub.
* If you have issues, let me know! This is new to many of you and may not be intuitive at first. But, I promise, you'll get the hang of it!
## Instructions
* Put your name at the top of the document.
* **For ALL graphs, you should include appropriate labels.**
* Feel free to change the default theme, which I currently have set to `theme_minimal()`.
* Use good coding practice. Read the short sections on good code with [pipes](https://style.tidyverse.org/pipes.html) and [ggplot2](https://style.tidyverse.org/ggplot2.html). **This is part of your grade!**
* When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don't do it before then, or else you might miss some important warnings and messages.
## Warm-up exercises with garden data
These exercises will reiterate what you learned in the "Expanding the data wrangling toolkit" tutorial. If you haven't gone through the tutorial yet, you should do that first.
1. Summarize the `garden_harvest` data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the `wday()` function from `lubridate`). Display the results so that the vegetables are rows but the days of the week are columns.
```{r}
garden_harvest %>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
group_by(vegetable, day_of_week) %>%
mutate(weight_lbs = weight*0.00220462) %>%
summarise(tot_harvest_wtlbs = sum(weight_lbs)) %>%
pivot_wider(names_from = day_of_week,
values_from = tot_harvest_wtlbs)
```
2. Summarize the `garden_harvest` data to find the total harvest in pound for each vegetable variety and then try adding the plot from the `garden_planting` table. This will not turn out perfectly. What is the problem? How might you fix it?
>Certain varieties were planted in different plots, but there was only one harvest recorded for each variety, meaning that the total harvest by variety would overestimate the amount.
We could fix this by defining a variable to hold the list of plots that each vegetable variety was planted in. This way we would only have one observation mapped to the total harvest for that variety.
```{r}
garden_harvest %>%
group_by(vegetable, variety) %>%
mutate(weight_lbs= weight*0.00220462) %>%
summarise(tot_hvs_lbs = sum(weight_lbs)) %>%
left_join(garden_planting,
by = c("vegetable", "variety"))
```
3. I would like to understand how much money I "saved" by gardening, for each vegetable type. Describe how I could use the `garden_harvest` and `garden_spending` datasets, along with data from somewhere like [this](https://products.wholefoodsmarket.com/search?sort=relevance&store=10542) to answer this question. You can answer this in words, referencing various join functions. You don't need R code but could provide some if it's helpful.
> (1) Group each dataset by vegetable. (2) In a the garden harvest dataset create a new variable to summarize the total cost of planting each vegetable. (3) Left join the store data and the spending data by vegetable (4) Multiply the in-store price by the number of lbs harvested (taken from garden harvest) to get the total price of buying each vegetable type. (5) Compute the difference between the total cost of buying the produce and the total cost of buying the seeds for each one to get the money saved by planting seeds.
4. Subset the data to tomatoes. Reorder the tomato varieties from smallest to largest first harvest date. Create a barplot of total harvest in pounds for each variety, in the new order.CHALLENGE: add the date near the end of the bar. (This is probably not a super useful graph because it's difficult to read. This is more an exercise in using some of the functions you just learned.)
```{r}
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
mutate(weight_lbs= weight*0.00220462) %>%
group_by(variety) %>%
summarise(first_plant_date = min(date),
tot_harvest = sum(weight_lbs)) %>%
ggplot(aes(x = tot_harvest, y = fct_reorder(variety, first_plant_date)))+
geom_col()+
labs(title = "Total Tomato Harvest by First Harvest Date", y = "Tomato Variety" , x = "")
```
5. In the `garden_harvest` data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use `str_to_lower()`, `str_length()`, and `distinct()`.
```{r}
garden_harvest %>%
mutate(lower_name = str_to_lower(variety),
name_length = str_length(variety)) %>%
arrange(vegetable, name_length) %>%
distinct(variety,.keep_all=TRUE)
```
6. In the `garden_harvest` data, find all distinct vegetable varieties that have "er" or "ar" in their name. HINT: `str_detect()` with an "or" statement (use the | for "or") and `distinct()`.
```{r}
garden_harvest %>%
distinct(variety, .keep_all = TRUE) %>%
mutate(has_string = str_detect(variety, "er|ar")) %>%
filter(has_string == TRUE)
```
## Bicycle-Use Patterns
In this activity, you'll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
<center>
![A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.](https://www.macalester.edu/~dshuman1/data/112/bike_station.jpg){width="30%"}
![One of the vans used to redistribute bicycles to different stations.](https://www.macalester.edu/~dshuman1/data/112/bike_van.jpg){width="30%"}
</center>
Two data tables are available:
- `Trips` contains records of individual rentals
- `Stations` gives the locations of the bike rental stations
Here is the code to read in the data. We do this a little differently than usual, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with `{r cache = TRUE}` rather than the usual `{r}`.
```{r cache=TRUE}
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
```
**NOTE:** The `Trips` data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. **When you have this working well, you should access the full data set of more than 600,000 events by removing `-Small` from the name of the `data_site`.**
### Temporal patterns
It's natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable `sdate` gives the time (including the date) that the rental started. Make the following plots and interpret them:
7. A density plot, which is a smoothed out histogram, of the events versus `sdate`. Use `geom_density()`.
```{r}
Trips %>%
ggplot(aes(x = sdate)) +
geom_density()+
labs(title = "Density of Bike Rentals by Rental Start Date", x = "", y ="")
```
8. A density plot of the events versus time of day. You can use `mutate()` with `lubridate`'s `hour()` and `minute()` functions to extract the hour of the day and minute within the hour from `sdate`. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.
```{r}
Trips %>%
mutate(rent_hour = hour(sdate)) %>%
mutate(rent_min = minute(sdate)) %>%
mutate(rent_time = rent_hour+(rent_min/60)) %>%
ggplot(aes(x = rent_time))+
geom_density()+
labs(title = "Density of Bike Rentals by Rental Hour", x = "Time (hour)", y ="")
```
9. A bar graph of the events versus day of the week. Put day on the y-axis.
```{r}
Trips %>%
group_by(sdate) %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
ggplot(aes(y = week_day)) +
geom_bar()+
labs(title = "Total Bike Rentals by Day of the Week", y = "", x ="")
```
10. Facet your graph from exercise 8. by day of the week. Is there a pattern?
>The pattern observed is that the density (amount of bike rentals) is much higher in the hours in which people are going and coming to work during the weekdays. On the weekends, the density is higher during times when people go on leisure bike rides (afternoon).
```{r}
Trips %>%
mutate(rent_hour = hour(sdate)) %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
mutate(rent_min = minute(sdate)) %>%
mutate(rent_time = rent_hour+(rent_min/60)) %>%
ggplot(aes(x = rent_time))+
geom_density()+
facet_wrap(vars(week_day))+
labs(title = "Density of Bike Rentals by Time of Day and Day of the Week", y = "", x ="Time of Day")
```
The variable `client` describes whether the renter is a regular user (level `Registered`) or has not joined the bike-rental organization (`Causal`). The next set of exercises investigate whether these two different categories of users show different rental behavior and how `client` interacts with the patterns you found in the previous exercises.
11. Change the graph from exercise 10 to set the `fill` aesthetic for `geom_density()` to the `client` variable. You should also set `alpha = .5` for transparency and `color=NA` to suppress the outline of the density function.
```{r}
Trips %>%
mutate(rent_hour = hour(sdate)) %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
mutate(rent_min = minute(sdate)) %>%
mutate(rent_time = rent_hour+(rent_min/60)) %>%
ggplot(aes(x = rent_time))+
geom_density(aes(fill = client), color = NA, alpha = .5)+
facet_wrap(vars(week_day))+
labs(title = "Density of Bike Rentals by Time of Day and Day of the Week", y = "", x ="")
```
12. Change the previous graph by adding the argument `position = position_stack()` to `geom_density()`. In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?
```{r}
Trips %>%
mutate(rent_hour = hour(sdate)) %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
mutate(rent_min = minute(sdate)) %>%
mutate(rent_time = rent_hour+(rent_min/60)) %>%
ggplot(aes(x = rent_time))+
geom_density(aes(fill = client), color = NA, alpha = .5, position = position_stack())+
facet_wrap(vars(week_day))+
labs(title = "Density of Bike Rentals by Time of Day and Client Status", y = "", x ="")
```
>I think the second version can be problematic because it only shows the difference between the client status visually, meaning we cannot clearly see the desities for each one individually. However, we can observe the total density, as both variables are grouped together, so this might be advantageous at certain points.
13. In this graph, go back to using the regular density plot (without `position = position_stack()`). Add a new variable to the dataset called `weekend` which will be "weekend" if the day is Saturday or Sunday and "weekday" otherwise (HINT: use the `ifelse()` function and the `wday()` function from `lubridate`). Then, update the graph from the previous problem by faceting on the new `weekend` variable.
```{r}
Trips %>%
mutate(rent_hour = hour(sdate)) %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
mutate(rent_min = minute(sdate)) %>%
mutate(rent_time = rent_hour+(rent_min/60)) %>%
mutate(weekend = ifelse(week_day == "Sat" | week_day == "Sun", "weekend", "weekday")) %>%
ggplot(aes(x = rent_time)) +
geom_density(aes(fill = client), color = NA, alpha = .5)+
facet_wrap(vars(weekend))+
labs(title = "Density of Bike Rentals by Time of Day and Client Status,
Faceted by Weekend and Weekday", y = "", x ="", color = "Client Status")
```
14. Change the graph from the previous problem to facet on `client` and fill with `weekday`. What information does this graph tell you that the previous didn't? Is one graph better than the other?
>It tells us how the registered and casual rentals fluctuate throughout the week in separate plots. Both plots show us the exact same information, only the facet and color variables are switched.
```{r}
Trips %>%
mutate(rent_hour = hour(sdate)) %>%
mutate(week_day = wday(sdate, label = TRUE)) %>%
mutate(rent_min = minute(sdate)) %>%
mutate(rent_time = rent_hour+(rent_min/60)) %>%
mutate(weekend = ifelse(week_day == "Sat" | week_day == "Sun", "weekend", "weekday")) %>%
ggplot(aes(x = rent_time, color = "NA")) +
geom_density(aes(fill = weekend), color = NA, alpha = .5)+
facet_wrap(vars(client))+
labs(title = "Density of Bike Rentals by Time of Day and Day of Week", y = "", x ="")
```
### Spatial patterns
15. Use the latitude and longitude variables in `Stations` to make a visualization of the total number of departures from each station in the `Trips` data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!
```{r}
Trips %>%
group_by(sstation) %>%
summarize(count_station = n()) %>%
left_join(Stations,
by = c("sstation" = "name")) %>%
ggplot(aes(x = lat, y = long, color = count_station)) +
geom_point()+
labs(title = "Total Number of Bike Departures by Station", y = "Latitude", x ="Longitud", color = "Number of Rentals")
```
16. Only 14.4% of the trips in our data are carried out by casual users. Create a plot that shows which area(s) have stations with a much higher percentage of departures by casual users. What patterns do you notice? (Again, we'll improve this next week when we learn about maps).
>The highest percentage of casual users are in areas where there are a lot of rentals in general (a central location that is convenient for renting a bike), or in an isolated area, meaning there could be a park or a scenic route where people go on leisure rides.
```{r}
Trips %>%
group_by(sstation) %>%
mutate(binary = ifelse(client == "Casual", 1, 0)) %>%
summarise(prop = mean(binary)) %>%
left_join(Stations,
by = c("sstation" = "name")) %>%
ggplot(aes(x = lat, y = long, color = prop)) +
geom_point()+
labs(title = "Proportion of Bike Departures by Station", y = "Latitude", x ="Longitude", color = "Proportion of Casual Rentals", )
```
**DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.**
## Dogs!
In this section, we'll use the data from 2022-02-01 Tidy Tuesday. If you didn't use that data or need a little refresher on it, see the [website](https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-02-01/readme.md).
17. The final product of this exercise will be a graph that has breed on the y-axis and the sum of the numeric ratings in the `breed_traits` dataset on the x-axis, with a dot for each rating. First, create a new dataset called `breed_traits_total` that has two variables -- `Breed` and `total_rating`. The `total_rating` variable is the sum of the numeric ratings in the `breed_traits` dataset (we'll use this dataset again in the next problem). Then, create the graph just described. Omit Breeds with a `total_rating` of 0 and order the Breeds from highest to lowest ranked. You may want to adjust the `fig.height` and `fig.width` arguments inside the code chunk options (eg. `{r, fig.height=8, fig.width=4}`) so you can see things more clearly - check this after you knit the file to assure it looks like what you expected.
```{r, fig.height=12, fig.width=4}
breed_traits_total <- breed_traits %>%
mutate(rank_total = rowSums(.[2:7]) + rowSums(.[10:17])) %>%
select(Breed, rank_total) %>%
filter(rank_total != 0)
ggplot(breed_traits_total, aes(y = fct_reorder(Breed, rank_total, .desc = TRUE), x = rank_total))+
geom_point()+
labs(title = "Total Ranking by Dog Breed", x= "", y = "")
```
18. The final product of this exercise will be a graph with the top-20 dogs in total ratings (from previous problem) on the y-axis, year on the x-axis, and points colored by each breed's ranking for that year (from the `breed_rank_all` dataset). The points within each breed will be connected by a line, and the breeds should be arranged from the highest median rank to lowest median rank ("highest" is actually the smallest numer, eg. 1 = best). After you're finished, think of AT LEAST one thing you could you do to make this graph better. HINTS: 1. Start with the `breed_rank_all` dataset and pivot it so year is a variable. 2. Use the `separate()` function to get year alone, and there's an extra argument in that function that can make it numeric. 3. For both datasets used, you'll need to `str_squish()` Breed before joining.
```{r}
breed_rank_all2 <- breed_rank_all %>%
pivot_longer(cols = starts_with("20"),
names_to = "year",
values_to = "value") %>%
separate(year,
into = c("year"),
convert = TRUE) %>%
mutate(breed = str_squish(Breed))
breed_top_20 <- breed_traits_total %>%
filter(rank_total>= 55) %>%
mutate(breed = str_squish(Breed))
new_data <-
breed_top_20 %>%
left_join(breed_rank_all2,
by = "breed") %>%
group_by(breed) %>%
mutate(median_rank = median(value))
ggplot(new_data,aes(x =year, y =fct_reorder(breed,desc(median_rank)), color = value)) +
geom_point()+geom_line()+
labs(title = "Top 20 Dog Breeds by Total Ranking and Year", color = "Ranking per Year", x = "", y = "")
```
>Breeds that only have one data point should not be included in the graph because it does not have enough information to create a valid total ranking.
19. Create your own! Requirements: use a `join` or `pivot` function (or both, if you'd like), a `str_XXX()` function, and a `fct_XXX()` function to create a graph using any of the dog datasets. One suggestion is to try to improve the graph you created for the Tidy Tuesday assignment. If you want an extra challenge, find a way to use the dog images in the `breed_rank_all` file - check out the `ggimage` library and [this resource](https://wilkelab.org/ggtext/) for putting images as labels.
```{r, fig.height=2, fig.width=5}
breed_traits_modeling <- breed_traits %>%
filter(`Coat Length` %in% c("Medium", "Long"), `Coat Type` %in% c("Smooth", "Silky")) %>%
mutate(desirable_traits = rowSums(.[c(10,13,14)])) %>%
select(Breed, desirable_traits)
breed_traits_modeling
breed_traits_total
breed_traits_total %>%
right_join(breed_traits_modeling) %>%
ggplot(aes(x = rank_total, y = desirable_traits))+
geom_point(aes(color = fct_reorder(Breed, rank_total)))+
labs(title = "Most Well-Rounded Dog Breeds for Modeling", x = str_to_upper("Total Ranking"), y = str_to_upper("Modeling Desirable Traits"), color = "Dog Breed")
```
## GitHub link
20. Below, provide a link to your GitHub page with this set of Weekly Exercises. Specifically, if the name of the file is 03_exercises.Rmd, provide a link to the 03_exercises.md file, which is the one that will be most readable on GitHub.
## Challenge problem!
This problem uses the data from the Tidy Tuesday competition this week, `kids`. If you need to refresh your memory on the data, read about it [here](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-15/readme.md).
21. In this exercise, you are going to try to replicate the graph below, created by Georgios Karamanis. I'm sure you can find the exact code on GitHub somewhere, but **DON'T DO THAT!** You will only be graded for putting an effort into this problem. So, give it a try and see how far you can get without doing too much googling. HINT: use `facet_geo()`. The graphic won't load below since it came from a location on my computer. So, you'll have to reference the original html on the moodle page to see it.
[](kids_data_karamanis.jpeg)
**DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?**