-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfiles-many.qmd
106 lines (78 loc) · 4.89 KB
/
files-many.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "Many files"
execute:
cache: false
---
It can be a challenge to face many files to work with. Often they are scattered across a number of directories, and it can be tempting for a novice to manually read each one in a script, and then try to bind them together. Even worse, sometimes files can less than cooperative for new coders by having complicated or non-standard layouts.
And, of course, invariably we want do handle the contents of the many files as *one* object, rather than many objects with one object per file.
In this tutorial we walk through some of the techniques available to the coder to simplify the steps to importing multi-file datasets into R (or python!)
## Mocked data
We have prepared a mock up of a dataset; it consists of studies of guppies from 5 sites. At each site we generate two files: a ["gup" data file](data/guppy/site_01/site_01.gup) and a ["YAML" configuration file](data/guppy/site_01/site_01.yaml). Each is a text file, but the `.gup` file consists of two parts: a header followed by the a CSV style table of data.
## Finding files
R-language provides convenient tools for locating files. If you find these don't suit your needs, then consider using the [fs](https://CRAN.R-project.org/package=fs) R package which has more features. For this project, we'll stick to using base R functionality as much as possible.
Here we use the [`list_guppy()` function](functions/guppy.R#L8). Here's what it finds...
```{r}
source("setup.R")
gup_files = list_guppy() |>
print()
```
### A closer look at thr function
Let's take a closer look at the function. The built-in `list.files()` function is the workhorse here.
```{r}
list_guppy = function(path = here::here("data", "guppy"),
pattern = glob2rx("*.gup"),
recursive = TRUE){
list.files(path, pattern = pattern, recursive= TRUE, full.names = TRUE)
}
```
The function accepts three arguments:
- `path` this is a string providing the pathway to the data files
- `pattern` this is [regular expression](https://en.wikipedia.org/wiki/Regular_expression). We converted a "glob" (wildcard notation) to a "regex" (regular expression) using the handy `glob2rx()` function where we requested the pattern of "any characters followed by '.R' at the very end".
- `recursive` tells `list.files()` to search deeply into subdirectories of `path`.
- `full.names` results in fully-formed filenames including the path.
## Reading the files
We have already written a convenience function, [`read_guppy()`](functions/guppy.R#36), which reads both the [configuration file](files-configurations.qmd) and the ["gup" data file](files-header-table.qmd) files, and then merges them into one file.
### Basic use
We pass in the listing of 5 filenames, and the function returns a single table
```{r}
x = read_guppy(gup_files) |>
glimpse()
```
Let's double check that the number of sites matches the number of files.
```{r}
dplyr::count(x, site_id)
```
We can also read the same data in, but this time request a spatial `sf` objects, where each row's `x` and `y` coordinates are transformed into spatial [POINT](https://r-spatial.github.io/sf/articles/sf1.html#sfg-simple-feature-geometry).
```{r}
x = read_guppy(gup_files, form = "sf") |>
print()
```
:::{.callout-note}
Whether you request a tibble (aka table or data.frame) or an sf object, you can use all of the [tidyverse](https://www.tidyverse.org/) tools for subsequent analyses. If you do your work with spatial analyses in mind, then [sf](https://r-spatial.github.io/sf/index.html) is the way to go.
:::
### A closer look
But how does the function work? How does it merge the metadata file with the tabular data?
The process is the same if you provide one filename or many filenames. We have written the [read_guppy() function](functions/guppy.R#L36) with the comments you see below in the pseudo-code,
where each commetn is flagged with a double hash `##`. You can study the code alongside the
comments to follow along.
```default
for each filename
check that the file at filename exists
use string processing to "make" the companion medatdata filename
read the metadata
check that the metadata file exists
read the metadata as a YAML file
return the metadata
scan all of the text lines in the filename
parse the header extracting the bits of info we want
read the table directly from the text
mutate the the table to include columns of data from the metadata and the header
add the crs info as an attribute
return the table
bind all of the tables by row
convert to sf-object if the user so requests
return the table/sf-object
```
:::{.callout-tip}
Reading and binding tables is such a common task it is well worth learning how to write your own functions rather than waiting for a solution to appear online. Once you have done one or two of these it gets to be **very** easy to do more as you combine little bites into a whole sandwich.
:::