-
Notifications
You must be signed in to change notification settings - Fork 3
/
05-output.Rmd
executable file
·219 lines (179 loc) · 9.03 KB
/
05-output.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# Reading and Writing Data
So far we have only used datasets from `R` packages or created toy tibbles. In this section, we will learn how to read in data from a variety of sources.
## Reading `csv` files with `readr`
We will start with the `readr` packages which is useful to a) read in `csv` files and to b) correctly parsing data columns. Before we start, let's load some packages.
```{r, warning = FALSE, message = FALSE}
library(readr)
library(tibble)
library(dplyr)
```
<!---
### Working directory
First steps first: To read in data from a specific location on our hard drive we should specify a working directory. You can get the current working directory with `getwd()` and set a new one with `setwd()`. In this case, I set the working directory to the `r_public` folder that I created before.
```{r}
cat("Current working directory:")
getwd()
my_wd = "/Users/jlanger/Dropbox/uzh_programming/r_public"
cat("New working directory: ", my_wd)
#setwd(my_wd) # delete comment here when running code
```
--->
### Reading in *-delimited data
The `readr` package provides several functions to read in delimited data:
* `read_csv()`: comma delimited
* `read_csv2()`: semicolon delimited
* `read_tsv()`: tab delimited
* `read_delim()`: any delimiter
To see how they work, let's create some data and read them in. (I know, I know still no real data. Be patient!)
```{r}
my_csv = "a, b, c, d
1, 2, 3, 4
5, 6, 7, 8"
read_csv(my_csv)
```
As you can see, the function correctly interpreted the first line of our string as variable names and the remaining elements as comma-separated integers. The other functions work in a similar way. There is one exception though: `read_delim`. It allows for more general specifications.
```{r}
my_csv = "a_ b_ c_ d
1_ 2_ Maria_ female
5_ 6_ Teresa_ female"
read_delim(my_csv, delim = "_", trim_ws = TRUE)
```
In this case, we specified an underscore as the delimiter and told the function to trim leading and trailing whitespace. Sometimes the `csv` file includes lines which we want `readr` to ignore. To do this we use the `skip` argument.
```{r}
my_csv = "Sometimes you can read some rubbish here
We don't want to import this
name, age
Julian, 29"
read_csv(my_csv, skip = 2)
```
At other times, the `csv` file does not provide variable names, we can provide them by passing a character vector as the `col_names` argument.
```{r}
my_csv = "Julian, 29\nTeresa, 25"
read_csv(my_csv, col_names = c("Name", "Age"))
```
There are more options and you can explore them by looking them up in the help file. For now, I will only show you one more useful option: You can use the `na` argument to specify the characters used in the `csv` file to indicate missing values.
```{r}
my_csv = "Julian, 29
Teresa, 25
Jonas, .
., 64"
read_csv(my_csv, col_names = c("Name", "Age"), na = ".")
```
## Parsing data correctly with the `parse` functions
Sometimes columns are not correctly interpreted when they are read in. For these vectors, we can parse them differently using the `parse` functions. Each of these functions takes a vector and returns a vector. For example, assume in the following that I want to parse the `age` column not as an integer vector but as a character vector. I can use the `parse_character` function for this.
```{r, error=TRUE}
my_csv = "Julian, 29
Teresa, 25
Jonas, .
., 64"
my_tibble = read_csv(my_csv, col_names = c("name", "age"), na = ".")
head(my_tibble)
str(parse_character(my_tibble$age))
```
There are a bunch of functions, each for a different kind of data type:
* `parse_logical`
* `parse_number` (`parse_double`, `parse_integer`)
* `parse_character`
* `parse_datetime` (`parse_date`, `parse_datetime`)
Let's look at some of them in the following.
### Parsing numbers
The `parse_number` function is pretty amazing. It can recover number vectors from almost anything! Just look at the following example.
```{r}
my_column = c("100$", "20%", "Something with a 100")
parse_number(my_column)
```
You use the locale function to control for country-specific delimiters for the decimal point and grouping.
```{r}
# comma instead of decimal point
my_column = c("1,23", "1,23", "1,245")
parse_number(my_column, locale = locale(decimal_mark = ","))
```
```{r}
# ' to group numbers
my_column = c("123'456'789")
parse_number(my_column, locale = locale(grouping_mark = "'"))
```
### Parsing characters
You would think parsing characters is pretty easy. There can be difficulties though because not everybody uses the same character encoding. To learn more about this topic, take a look at this website: http://www.w3.org/International/articles/definitions-characters/. We only need to know that different encodings exist and they can lead to problems. R usually assumes to 'UTF-8' encoding (and you should use it too!). See what happens if we read in characters with `Latin-1` encoding:
```{r}
x1 = "El Ni\xf1o was particularly bad this year"
parse_character(x1)
```
Well, that does not look nice. But luckily enough, we can use the `locale` function to tell `readr` that the string is encoded with `Latin-1`.
```{r}
parse_character(x1, locale = locale(encoding = "Latin1"))
```
Now, this time the parsing is correct! You can also use `readr` to try to guess the encoding with the `guess_encoding` function. Look up its help file if you want to know more.
### Parsing dates
```{r}
# specify time zone maybe
parse_datetime("2016-09-08T0708")
parse_datetime("20160908T0708")
```
```{r}
parse_date("2016-09-08")
parse_date("2016/09/08")
```
```{r}
library(hms)
parse_time("01:10 am")
```
```{r}
parse_date("27/05/1987", "%d/%m/%Y")
```
## Parsing and reading at the same time
Each `parse` function has a corresponding `col` function. This allows you to use the `parse` function to find out how to correctly parse a column and then specify the correct parsing right at the beginning of the data processing using the corresponding `col` function. I usually read in data in three steps.
1. First, I read in all columns as character vectors. This allows me to browse the data and determine the correct parsing. To read in every column as a character vector, you can use the `.default` argument in the `col_types` function.
```{r}
challenge1 = read_csv(readr_example("challenge.csv"),
# there's an example dataset in the readr package called challenge.csv
col_types = cols(
.default = col_character()
))
```
2. I can try out different parsers using the `parse` functions. (Note that you can use the `parse` functions from the `readr` package together with other packages such as `readxl`). In this case, browsing and parsing will lead you to conclude that the correct parsers are `parse_double` and `parse_data`, respectively.
3. Finally, we specify the correct parsers directly at the beginning of the data processing stage using the `col` functions that correspond to the `parse` functions.
```{r}
challenge2 = read_csv(readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
))
head(challenge2)
```
## Writing `csv` files
You can use the `readr` package to write `csv` files as well. In this case, we want to save our cleaned up dataframe as a `csv` file in a `dataframes` sub-folder. For this, we first check whether the sub-folder already exists. If it does not, we create it.
```{r}
if (!file.exists("dataframes")) {
dir.create("dataframes")
}
```
Then, we write the `csv` file using the `write_csv` function.
```{r}
write_csv(challenge2, "./dataframes/challenge2.csv")
```
You can check now in your working directory whether this worked. Note that a `csv` file does not store the information about the correct parsing of the data columns.
```{r}
read_csv("./dataframes/challenge2.csv")
```
We have to specify the correct parsing again! If you only work with `R` and the dataframe is not too big, you can store the dataframe as an `RDS` file instead.
## Write and read `RDS` files with `readr`
There is not much to say here apart from the fact that the `RDS` file 'remembers' the correct parsing.
```{r}
write_rds(challenge2, "./dataframes/challenge2.rds")
challenge_rds = read_rds("./dataframes/challenge2.rds")
head(challenge_rds)
```
## Reading in Excel sheets and Stata data with `readxl` and `haven`
Many datasets are stored in Excel sheets. You can read them in using the `readxl` package. A similar package, `haven`, exists for Stata files. I will not use them here, but can look at the helpfiles for the commands `read_excel` and `read_dta`.
```{r}
library(readxl)
help(read_excel)
```
```{r, message = FALSE}
library(haven)
help(read_dta)
```
If you want to try them out, you can use the data files from the 'Baby-Wooldridge' here: http://www.cengage.com/aise/economics/wooldridge_3e_datasets/.
## Sources {-}
The exposition here is inspired by the notes for a new book on R data science by Garrett Grolemund and Hadley Wickham. You can find detailed outlines here: http://r4ds.had.co.nz.