dplyr and tidyr notes.Rmd

---
title: "dplyr"
author: "Camila Vargas"
date: "3/6/2018"
output: html_document
---

Chapter 6: Data wrangling - dplyr

How do you organize your data - Data wrangling - keeping the raw data raw but moving around columnes and rows to have the necesary tables to answer the questions you want to answer.

First make it tydy and the wrangle it how you want it.

tidyverse - group of packeges that contains all the necesary packeges to organize, transform adn visualize your data.

We are going to wrangle gapminder data using `diplyr`

```{r}
library(tidyverse) #intall.packages (tidyverse)

#read in data from url
gapminder <- read.csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/gapminder.csv")

```

Explore
```{r}
head(gapminder) #explore the first rows and columnes of the data set

```

```{r}
tail(gapminder, 10) #10 last entries of the data source
```

Explore the structure of the data

```{r}
str(gapminder)
```


Data frame allows you to have number and words in one data set.
Tibble is the new generation of data frame where you can combine all sort of things, and easily use ggplot


Othe information about the dataframe

```{r}
names(gapminder) #colnames
dim(gapminder) #dimension of the data frame how many rows and how many col
ncol(gapminder) #number of col
nrow(gapminder) #number of rows of gapminder


#create a dim from ncol and nrow
c(nrow(gapminder), ncol(gapminder)) #c() is a function that combines. In this case number of rows and number of col in the data frame gapminder


```


Summary statistics
Looks at statistics quickly and also tells you if there are any NA

```{r}
summary(gapminder)
```

Skimmer packege (skim function for organized summary of data frames)

```{r}
library(skimr) #instal.package (skimr)
skim(gapminder)

```


Explore inside the data frame

```{r}
gapminder$lifeExp
```

```{r}
head(gapminder$lifeExp) #just the head of the variable lifexpectancy
```

Filtering observation across rows we use the filter function

Selecting is how you filter by columnes

mutate

summarise

arrange

#Using dplyr

Filter by rows
```{r}
library(tidyverse)
filter(gapminder, lifeExp <29)
names(gapminder)

filter(gapminder, country=="Mexico")

#to filter two countries use the %in% operator

x <- filter(gapminder, country %in% c("Mexico", "Chile"))


#filter one country in a specific year

filter(gapminder, country=="Mexico", year==2002)

```


Select by columnes

```{r}
select(gapminder, year, lifeExp)
```

Use the `-` to desselect

```{r}
select(gapminder, -continent, -lifeExp)
```


Use filter and select together

```{r}
gap_Cambidia <- filter(gapminder, country=="Cambodia")
gap_Cambodia2 <- select(gap_Cambidia, -continent, -lifeExp)


```


Life changing pipe operator %>% ... "And then"

```{r}
gapminder %>% head() #equivalent to head(gapminder)
```


```{r}
gapminder %>% head(3) #equivalent to head(gapminder, 3)
```


Lets do this
gap_Cambidia <- filter(gapminder, country=="Cambodia")
gap_Cambodia2 <- select(gap_Cambidia, -continent, -lifeExp)

using the pipe

```{r}
gap_cambodia <- gapminder %>% filter(country=="Cambodia")
gap_cambodia2 <- gap_cambodia %>% select(-continent, -lifeExp)

#even better and easier

gap_cambodia <- gapminder %>% 
  filter(country=="Cambodia") %>% 
  select(-continent, lifeExp)


```


More tidyr!!

How to add variables using `mutate`

```{r}
gapminder %>% 
  mutate(index=1:nrow(gapminder)) #adding a column call index that goes from 1 to as many rows as the data frame has
```


```{r}
gapminder %>% 
  mutate(gdp= pop*gdpPercap) #create a new columne that multiplies the pop variable from gapminder data set times gdpPercap. Once I anounce the data frame name i can just call the variables
```

Excersise
Find the max gdp Egypt and Vietnam and safe it ain a new column

```{r}
gapminder %>% 
  filter(country %>% c("Egypt", "Vietnam")) %>% 
  mutate(max_gdp=max(gdpPercap))


gapminder %>% 
  group_by(country) %>% 
  filter(country %>% c("Egypt", "Vietnam")) %>% 
  mutate(max_gpd=max(gdpPercap)) %>% 
  ungroup() 


```

When you use the function group by, you have to make sure to ungroup the data frame if not you can encounter errors when working witht he grouped frame.
For solving this you have to ungroup at the ende onf the code.


Sdummerise and groupby finctions together!

Finding tme max gdp for all countries!

```{r}
gapminder %>% 
  group_by(country) %>% 
  mutate(gdp= pop*gdpPercap) %>%
  summarize(max_gdp= max(gdp)) %>% 
  ungroup()
  

```


Arraging the information in a certain order
Default form arrange funtion is from min to max

```{r}
gapminder %>% 
  group_by(country) %>% 
  mutate(gdp= pop*gdpPercap) %>%
  summarize(max_gdp= max(gdp)) %>% 
  ungroup() %>% 
  arrange(max_gdp)
```

For arranging in the max to min way

```{r}
gapminder %>% 
  group_by(country) %>% 
  mutate(gdp= pop*gdpPercap) %>%
  summarize(max_gdp= max(gdp)) %>% 
  ungroup() %>% 
  arrange(desc(max_gdp))

```

Lets try to gdp in countries in Asia

```{r}
gapminder %>% 
  filter(continent=="Asia") %>% 
  group_by(country) %>%
  mutate(gdp= pop*gdpPercap) %>%
  summarize(max_gdp= max(gdp)) %>% 
  ungroup() %>% 
  arrange(desc(max_gdp))
  
  
```


Joining Data Sets

Make desicion on what is important and whatyou want to keep. For example which informationof what columnes you want to merge. This will determine if you are doing a left join or a right join.

```{r}
#read Co2 data
co2 <- read.csv("https://raw.githubusercontent.com/OHI-Science/data-science-training/master/data/co2.csv")


#explore the data
head(co2)

#gapminder 2007

gap_2007 <- gapminder %>% 
  filter(year==2007)

#left join gap_2007 to co2

lj <- left_join(gap_2007, co2, by="country")


```


Explore lj

```{r}
lj %>% dim()
lj %>% summary()
```

Right join

```{r}
rj <- right_join(gap_2007, co2, by="country")
```