-
Notifications
You must be signed in to change notification settings - Fork 10
/
class8.qmd
302 lines (196 loc) · 15.9 KB
/
class8.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
title: "Practicing on Real-World Data"
subtitle: "Project Management, Day 2"
authors: "Lorrie He, Matthew Sutcliffe"
format:
html:
toc: true
output:
html_document:
editor: visual
---
```{r setup, echo=FALSE}
# read in data for later
ebola0 <- read.table("data/project-day-2-files/datasets/ebola.csv", header = TRUE, sep = ",")
temps0 <- read.table("data/project-day-2-files/datasets/temperatures.csv", header = TRUE, sep = ",")
avida0 <- read.table("data/project-day-2-files/datasets/avida/avida_wildtype.csv", header = TRUE, sep = ",")
knitr::opts_chunk$set(echo=FALSE, digits = 3)
```
## Final class!
In the previous class, you completed the following steps:
> 1. Created a GitHub account.
>
> 2. Linked your GitHub account with your Longleaf account.
>
> 3. Created and pushed your first repository to GitHub.
Today, now that you've been through the entire course, we want to reinforce some of the skills that you've learned by having you apply your knowledge to several existing datasets and go through the wrangling, analysis, and visualization process as independently as you can. By the end of this lesson, you should have a GitHub repo containing scripts for wrangling and analysis for at least one of these datasets that you can show to others (or at least have as a reference for your future self).
For each of these datasets (or as many as time allows), we'd like you to do some basic analysis, including the following steps:
> 1. Create a directory for the analysis of this dataset, using best practices discussed in the previous class.
>
> 2. Load in the data.
>
> 3. Perform some initial data exploration (What are the rows? What are the columns? How many samples does this dataset have? etc.)
>
> 4. Identify at least 1 research question that you could try to answer with this dataset.
>
> 5. Format the data in a way that allows for this analysis.
>
> 6. Visualize the data in helpful ways to answer your question.
>
> 7. Make your code reproducible by using GitHub.
To help with your analyses, we have included a **project template** document outlining these steps that you are welcome to use in `data/project-day-2-files/example-analyses`; the location of additional files are specified below.
## Datasets
The **csvs** are available under in `data/project-day-2-files/datasets`. We have provided links to data that is publicly available for download, as well as some related publications that may give you more information about the data.
These datasets were chosen because we thought they represented a good variety in the different kinds of wrangling and visualization challenges you might encounter with your own data in the future: we have highlighted a few specific R skills that some of these datasets were meant to challenge you on.
::: callout-note
The datasets are presented roughly in order of increasing wrangling complexity. Though you should have most of the basic skills you need to wrangle and analyze these datasets, we have specifically provided some code to aid you in certain wrangling tasks for some of the later datasets: see the "Analysis ideas and hints" callout for each dataset.
:::
For each dataset, we have provided a description of the research question/data collection methods, metadata, and the first few lines of each dataset. Read through these descriptions and work on the dataset you find most interesting. Feel free to work with a partner!
::: {.callout-tip title="But I thought this was HTLTCode, not HTLTScience!"}
To get the most out of today's activities, we highly encourage you to practice going through the whole data analysis workflow as independently as you can, using your notes and practicing your Googling skills to develop your own research questions and analysis ideas for these datasets.
That said, we have prepared some example research questions and figures for each dataset to get everyone started, so you can decide if you'd rather focus more on coding or science-ing for each dataset. These research questions are split into "easier" and "harder" coding challenges, with "harder" challenges generally involving slightly more wrangling: all datasets have a **pre-filled project template** in the same location as the blank template document that will explain the code used to produce the example figures.
Thus, based on your comfort level, you can decide how much of the data analysis workflow you want to try independently today. You can:
1. Just recreate our figures,
2. Develop your own analysis and figures to answer our research questions, or
3. Develop your own research questions, analysis, and figures.
:::
### Avida digital evolution dataset
[Related publications](https://avida-ed.msu.edu/digital-evolution/)[^1]
[^1]: In particular, see [Lenski et al. (2003)](https://www.nature.com/articles/nature01568) and [Smith et al. (2016)](https://link.springer.com/article/10.1186/s12052-016-0060-0).
**Skills to practice:** working with multiple files, pivoting
**Background:** The following data was generated using [Avida-ED](https://avida-ed.msu.edu/avida-ed-application/), an online educational application that allows one to study the dynamics of evolutionary processes. Digital, asexually-reproducing organisms known as "Avidians" can be placed into something akin to a virtual Petri dish to evolve in, and one can manipulate parameters such as mutation rate, resource availability, and dish size to study how those factors affect the evolution of the population.
Your friend needs your help to analyze their Avida-Ed data. They designed a series of experiments around a mutant Avidian that gets an energy bonus when the sugar "nanose" is present and a wildtype Avidian that does not. They wanted to see how competition and resource availability affect the population dynamics of these Avidians.
Your friend grew either the wildtype only, mutant only, or both populations together (competition) in the following 3 environments:
- Minimal (no additional sugars present)
- Selective (only nanose present)
- Rich (nanose and additional sugars present)
![Avida experiment setup](data/project-day-2-files/figs/avida-expt.png){fig-align="center" width="65%"}
Help your friend get started with some of the exploratory data analysis. *What relationships can you find between genotype, competition, and resource availability?*
::: {.callout-note collapse="true" title="Data overview"}
::: panel-tabset
#### Data preview
Data for `avida_wildtype.csv` shown only:
```{r}
knitr::kable(head(avida0))
```
#### Column metadata
| Column | Type | Description | Values |
|------------------|------------------|-------------------|------------------|
| `update` | integer | Time elapsed | Ranges 0-300 |
| `condition` | character | Testing conditions (single population or competition) | Based off csv: `wildtype-only`, `mutant-only`, `competition` |
| `media_avg.fitness` | numeric | Average individual reproductive success in specified media | Ranges 0-1 |
| `media_avg.offspring.cost` | numeric | Average individual reproductive cost in specified media | |
| `media_avg.energy.acq.rate` | numeric | Average individual rate of energy acquisition from environment in specified media | |
| `media_pop.size` | integer | Population size in specified media | Ranges 0-900 |
All population measurements were determined for `minimal`, `selective`, and `rich` media.
:::
:::
::: {.callout-tip collapse="true" title="Analysis ideas and hints"}
::: panel-tabset
#### Analysis hints
- Which files have the pieces of data you need? Do you need to mix-and-match anything?
- The data could be "tidier"...
#### Example questions
Use these questions to develop your own or answer them as-is. There is no "single" or "correct" way to answer to these questions, and you can refine or broaden the scope of these questions as needed.
**Bolded** questions have an accompanying figure and code.
------------------------------------------------------------------------
Easier
- **How does population size change over time for the wildtype (or mutant) in different medias?**
- What is the relationship between fitness and offspring cost in competition conditions in different medias?
Harder
- **How does the change in fitness over time compare between different treatment conditions and media types?**
- Which media type allows for the most maximal performance of the wildtype? The mutant? Competition conditions?
#### Example figures
**Easier: How does population size change over time for the wildtype (or mutant) in different medias?**
![](data/project-day-2-files/figs/avida-easy.png){width="75%"}
**Harder: how does the change in fitness over time compare between different treatment conditions and media types?**
![](data/project-day-2-files/figs/avida-hard.png){width="75%"}
:::
:::
### Western Africa Ebola public health dataset
[Data source](https://www.kaggle.com/datasets/imdevskp/ebola-outbreak-20142016-complete-dataset)
**Skills to practice:** working with dates, pivoting
**Background:** The Western African Ebola virus (EV) epidemic of 2013-2016 is the most severe outbreak of the EV disease in history. It caused major disruptions and loss of life, mainly in the republics of Guinea, Liberia, and Sierra Leone.
*How might you depict the dynamics of this outbreak?*
::: {.callout-note collapse="true" title="Data overview"}
::: panel-tabset
#### Data preview
```{r}
knitr::kable(head(ebola0))
```
#### Column metadata
| Column | Type | Description | Values |
|-------------------|------------------|------------------|------------------|
| `Country` | character | Country of report | |
| `Date` | character | Date of report | YYYY-MM-DD |
| `Cumulative.no..of.confirmed..probable.and.suspected.cases` | numeric | Cumulative number till this day | |
| `Cumulative.no..of.confirmed..probable.and.suspected.deaths` | numeric | Cumulative number till this day | |
:::
:::
::: {.callout-tip collapse="true" title="Analysis ideas and hints"}
::: panel-tabset
#### Analysis hints
- This dataset contains data for other countries besides the three named above. For simplicity, you may want to focus on only those three regions.
- The data could be "tidier"...
- You can use `format()` to extract specific parts of a date object: e.g. if `x` is a date object with the format `%Y-%m-%d`, you can get the year with `format(x, "%Y")`.
#### Example questions
Use these questions to develop your own or answer them as-is. There is no "single" or "correct" way to answer to these questions, and you can refine or broaden the scope of these questions as needed.
**Bolded** questions have an accompanying figure and code.
------------------------------------------------------------------------
Easier
- **How many cases and deaths in total were recorded by each country from 2014-2016?**
- For a specific year, how did the number of cases and deaths change over time for each country?
Harder
- **By country, how did the average number of cases and death change each year?**
- Are there any seasonal patterns in the average cases and deaths?
#### Example figures
**Easier: How many cases and deaths in total were recorded by each country from 2014-2016?**
![](data/project-day-2-files/figs/ebola-easy.png){width="75%"}
**Harder: By country, how did the average number of cases and death change each year?**
![](data/project-day-2-files/figs/ebola-hard.png){width="75%"}
:::
:::
### Heat exposure in Phoenix, Arizona ecological dataset
[Data source](https://data.sustainability-innovation.asu.edu/cap-portal/metadataviewer?packageid=knb-lter-cap.647.2) \| [Related publication](https://doi.org/10.1016/j.envint.2020.106271)
**Skills to practice:** parsing strings, dealing with `NA` values
**Background:** Exposure to extreme heat is of growing concern with the rise of urbanization and ongoing climate change. Though most current knowledge about heat-health risks are known and implemented at the neighborhood level, less is known about individual experiences of heat, which can vary due to differences in access to cooling resources and activity patterns.
To further investigate, the Central Arizona-Pheonix Long-Term Ecological Research Program [(CAP-LTER)](https://sustainability-innovation.asu.edu/caplter/) recruited participants from 5 Pheonix-area neighborhoods to wear air temperature sensors that recorded their individually-experienced temperatures (IETs) as they went about their daily activities.
*What relationships can you find between individual activity, temperature, and neighborhood?*
::: {.callout-note collapse="true" title="Data overview"}
::: panel-tabset
#### Data preview
```{r}
knitr::kable(head(temps0))
```
#### Column metadata
| Column | Type | Description | Values |
|------------------|------------------|------------------|-------------------|
| `Subject.ID` | character | Subject identifier where number (1-5) corresponds to neighborhood | 1=Coffelt, 2=Encanto-Palmcroft, 3=Garfield, 4=Thunderhill, 5=Power Ranch |
| `period` | character | 4 hour measurement period | weekday, period |
| `temperature` | numeric | 4 hour average of IET during specified period | Celcius |
:::
:::
::: {.callout-tip collapse="true" title="Analysis ideas and hints"}
::: panel-tabset
#### Analysis hints
- We recommend using functions in the `stringr` package from the tidyverse to parse the strings, particularly `str_sub()` and `str_split_fixed()`.
- e.g. If you had a vector `x` containing a bunch of 2 character codes (`XX`), you could extract the first character with `str_sub(x, 1, 1).`
- e.g. If you had a vector `x` containing information in the format `first-second-third`, you could extract that first chunk with `str_split_fixed(x, "-", n=3)[,1].`
- R has some "counting" functions such as `length()` and `n()`: how could you get unique counts? Is there a single function that does that?
#### Example questions
Use these questions to develop your own or answer them as-is. There is no "single" or "correct" way to answer to these questions, and you can refine or broaden the scope of these questions as needed.
**Bolded** questions have an accompanying figure and code.
------------------------------------------------------------------------
Easier
- **How many participants were there for each neighborhood in the study?**
- On average, which 4-hour measurement period is the warmest? The coolest?
Harder
- **What was the daily average temperature for each neighborhood during the study period?**
- For a specific neighborhood on a specific day, what were all of the individual temperatures experienced for each 4-hour measurement period?
#### Example figures
**Easier: How many participants were there for each neighborhood in the study?**
![](data/project-day-2-files/figs/temps-easy.png){width="75%"}
**Harder: What was the daily average temperature for each neighborhood during the study period?**
![](data/project-day-2-files/figs/temps-hard.png){width="75%"}
:::
:::