-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdplyr and tidyr notes.Rmd
316 lines (188 loc) · 5.7 KB
/
dplyr and tidyr notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
---
title: "dplyr"
author: "Camila Vargas"
date: "3/6/2018"
output: html_document
---
Chapter 6: Data wrangling - dplyr
How do you organize your data - Data wrangling - keeping the raw data raw but moving around columnes and rows to have the necesary tables to answer the questions you want to answer.
First make it tydy and the wrangle it how you want it.
tidyverse - group of packeges that contains all the necesary packeges to organize, transform adn visualize your data.
We are going to wrangle gapminder data using `diplyr`
```{r}
library(tidyverse) #intall.packages (tidyverse)
#read in data from url
gapminder <- read.csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/gapminder.csv")
```
Explore
```{r}
head(gapminder) #explore the first rows and columnes of the data set
```
```{r}
tail(gapminder, 10) #10 last entries of the data source
```
Explore the structure of the data
```{r}
str(gapminder)
```
Data frame allows you to have number and words in one data set.
Tibble is the new generation of data frame where you can combine all sort of things, and easily use ggplot
Othe information about the dataframe
```{r}
names(gapminder) #colnames
dim(gapminder) #dimension of the data frame how many rows and how many col
ncol(gapminder) #number of col
nrow(gapminder) #number of rows of gapminder
#create a dim from ncol and nrow
c(nrow(gapminder), ncol(gapminder)) #c() is a function that combines. In this case number of rows and number of col in the data frame gapminder
```
Summary statistics
Looks at statistics quickly and also tells you if there are any NA
```{r}
summary(gapminder)
```
Skimmer packege (skim function for organized summary of data frames)
```{r}
library(skimr) #instal.package (skimr)
skim(gapminder)
```
Explore inside the data frame
```{r}
gapminder$lifeExp
```
```{r}
head(gapminder$lifeExp) #just the head of the variable lifexpectancy
```
Filtering observation across rows we use the filter function
Selecting is how you filter by columnes
mutate
summarise
arrange
#Using dplyr
Filter by rows
```{r}
library(tidyverse)
filter(gapminder, lifeExp <29)
names(gapminder)
filter(gapminder, country=="Mexico")
#to filter two countries use the %in% operator
x <- filter(gapminder, country %in% c("Mexico", "Chile"))
#filter one country in a specific year
filter(gapminder, country=="Mexico", year==2002)
```
Select by columnes
```{r}
select(gapminder, year, lifeExp)
```
Use the `-` to desselect
```{r}
select(gapminder, -continent, -lifeExp)
```
Use filter and select together
```{r}
gap_Cambidia <- filter(gapminder, country=="Cambodia")
gap_Cambodia2 <- select(gap_Cambidia, -continent, -lifeExp)
```
Life changing pipe operator %>% ... "And then"
```{r}
gapminder %>% head() #equivalent to head(gapminder)
```
```{r}
gapminder %>% head(3) #equivalent to head(gapminder, 3)
```
Lets do this
gap_Cambidia <- filter(gapminder, country=="Cambodia")
gap_Cambodia2 <- select(gap_Cambidia, -continent, -lifeExp)
using the pipe
```{r}
gap_cambodia <- gapminder %>% filter(country=="Cambodia")
gap_cambodia2 <- gap_cambodia %>% select(-continent, -lifeExp)
#even better and easier
gap_cambodia <- gapminder %>%
filter(country=="Cambodia") %>%
select(-continent, lifeExp)
```
More tidyr!!
How to add variables using `mutate`
```{r}
gapminder %>%
mutate(index=1:nrow(gapminder)) #adding a column call index that goes from 1 to as many rows as the data frame has
```
```{r}
gapminder %>%
mutate(gdp= pop*gdpPercap) #create a new columne that multiplies the pop variable from gapminder data set times gdpPercap. Once I anounce the data frame name i can just call the variables
```
Excersise
Find the max gdp Egypt and Vietnam and safe it ain a new column
```{r}
gapminder %>%
filter(country %>% c("Egypt", "Vietnam")) %>%
mutate(max_gdp=max(gdpPercap))
gapminder %>%
group_by(country) %>%
filter(country %>% c("Egypt", "Vietnam")) %>%
mutate(max_gpd=max(gdpPercap)) %>%
ungroup()
```
When you use the function group by, you have to make sure to ungroup the data frame if not you can encounter errors when working witht he grouped frame.
For solving this you have to ungroup at the ende onf the code.
Sdummerise and groupby finctions together!
Finding tme max gdp for all countries!
```{r}
gapminder %>%
group_by(country) %>%
mutate(gdp= pop*gdpPercap) %>%
summarize(max_gdp= max(gdp)) %>%
ungroup()
```
Arraging the information in a certain order
Default form arrange funtion is from min to max
```{r}
gapminder %>%
group_by(country) %>%
mutate(gdp= pop*gdpPercap) %>%
summarize(max_gdp= max(gdp)) %>%
ungroup() %>%
arrange(max_gdp)
```
For arranging in the max to min way
```{r}
gapminder %>%
group_by(country) %>%
mutate(gdp= pop*gdpPercap) %>%
summarize(max_gdp= max(gdp)) %>%
ungroup() %>%
arrange(desc(max_gdp))
```
Lets try to gdp in countries in Asia
```{r}
gapminder %>%
filter(continent=="Asia") %>%
group_by(country) %>%
mutate(gdp= pop*gdpPercap) %>%
summarize(max_gdp= max(gdp)) %>%
ungroup() %>%
arrange(desc(max_gdp))
```
Joining Data Sets
Make desicion on what is important and whatyou want to keep. For example which informationof what columnes you want to merge. This will determine if you are doing a left join or a right join.
```{r}
#read Co2 data
co2 <- read.csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/co2.csv")
#explore the data
head(co2)
#gapminder 2007
gap_2007 <- gapminder %>%
filter(year==2007)
#left join gap_2007 to co2
lj <- left_join(gap_2007, co2, by="country")
```
Explore lj
```{r}
lj %>% dim()
lj %>% summary()
```
Right join
```{r}
rj <- right_join(gap_2007, co2, by="country")
```