-
Notifications
You must be signed in to change notification settings - Fork 3
/
identify-suspicious-coordinate-data.Rmd
238 lines (207 loc) · 11 KB
/
identify-suspicious-coordinate-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "How to identify specimen records with suspicious coordinate data, using R and iDigBio"
output:
html_document:
code_folding: show
df_print: kable
---
Code here written by [Erica Krimmel](https://orcid.org/0000-0003-3192-0080). Please see **Use Case: [Identify specimen records with suspicious coordinate data](https://biodiversity-specimen-data.github.io/specimen-data-use-case/use-case/identify-suspicious-coordinate-data)** for context. Code here is modified from [original](https://github.com/ekrimmel/idigbio-api-dq-geo) given in a presentation at the 2019 ADBC Summit in Gainesville, FL.
```{r message=FALSE}
# Load core libraries; install these packages if you have not already
library(ridigbio)
library(tidyverse)
# Load library for making nice HTML output
library(kableExtra)
# Load libraries for visualizing geographic data
library(leaflet)
```
In this use case for the [iDigBio API](https://github.com/iDigBio/idigbio-search-api/wiki) we explore a situation where **geographic coordinate data** from the provider was modified by iDigBio during its data quality assurance process. See [here](https://github.com/iDigBio/idigbio-search-api/wiki/Data-Quality-Flags) for more information about iDigBio's data quality flags.
## Write a query to search for specimen records
First, let's find all the specimen records for the data quality flag we are interested in. Do this using the `idig_search_records` function from the `ridigbio` package. You can learn more about this function from the [iDigBio API documentation](https://github.com/iDigBio/idigbio-search-api/wiki) and [ridigbio documentation](https://cran.r-project.org/web/packages/ridigbio/ridigbio.pdf). In this example, we want to start by searching for specimens flagged with "rev_geocode_corrected."
```{r}
# Edit the fields (e.g. `flags`) and values (e.g. "rev_geocode_corrected") in
# `list()` to adjust your query and the fields (e.g. `uuid`) in `fields` to
# adjust the columns returned in your results
df_flagCoord <- idig_search_records(rq = list(flags = "rev_geocode_corrected",
institutioncode = "lacm"),
fields = c("uuid",
"institutioncode",
"collectioncode",
"country",
"data.dwc:country",
"stateprovince",
"county",
"locality",
"geopoint",
"data.dwc:decimalLongitude",
"data.dwc:decimalLatitude",
"flags"),
limit = 100000) %>%
# Rename fields to more easily reflect their provenance (either from the
# data provider directly or modified by the data aggregator)
rename(provider_lon = `data.dwc:decimalLongitude`,
provider_lat = `data.dwc:decimalLatitude`,
provider_country = `data.dwc:country`,
aggregator_lon = `geopoint.lon`,
aggregator_lat = `geopoint.lat`,
aggregator_country = country,
aggregator_stateprovince = stateprovince,
aggregator_county = county,
aggregator_locality = locality) %>%
# Reorder columns for easier viewing
select(uuid, institutioncode, collectioncode, provider_lat, aggregator_lat,
provider_lon, aggregator_lon, provider_country, aggregator_country,
aggregator_stateprovince, aggregator_county, aggregator_locality,
flags)
```
Here is what our query result data looks like:
```{r echo = FALSE}
# Subset `df_flagCoord` to show example
df_flagCoord[1:50,] %>%
select(-flags) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
font_size = 12,
fixed_thead = T) %>%
scroll_box(width = "100%", height = "400px")
```
## Visualize suspicious coordinates
One example of a geographic coordinate data quality issue would be that the latitude/longitude has a reversed sign, e.g. the data provider gave the value *latitude* = "7.1789" but meant *latitude* = "-7.1789." In the map below we can see a few examples of specimen records published to iDigBio where this is the case. These data have been adjusted by iDigBio and this action is recorded with the data quality flag "rev_geocode_flip_lat_sign."
```{r}
# Create function to allow subsetting the `df_flagCoord` dataset by other flags
# found on these same records
df_flagSubset <- function(subsetFlag) {
df_flagCoord %>%
filter(grepl(subsetFlag, flags)) %>%
select(uuid, matches("_lat|_lon")) %>%
unite(provider_coords, c("provider_lat", "provider_lon"), sep = ",") %>%
unite(aggregator_coords, c("aggregator_lat", "aggregator_lon"), sep = ",") %>%
gather(key = type, value = coordinates, -uuid) %>%
separate(coordinates, c("lat","lon"), sep = ",") %>%
mutate(lat = as.numeric(lat)) %>%
mutate(lon = as.numeric(lon)) %>%
arrange(uuid, type)}
# Subset `df_flagCoord` by records flagged for having had their latitude negated
# to place point in stated country by reverse geocoding process
df_rev_geocode_lat_sign <- df_flagSubset("rev_geocode_lat_sign")
# Create map displaying a few examples of records with the
# rev_geocode_flip_lat_sign flag
pal <- colorFactor(palette = c("#d7191c", "#fdae61", "#ffffbf", "#abdda4", "#2b83ba"),
domain = df_rev_geocode_lat_sign$uuid[1:10])
map <- df_rev_geocode_lat_sign[1:10,] %>%
mutate(popup = str_c(type, " = ", lat, ", ", lon, sep = "")) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(
lng = ~lon,
lat = ~lat,
radius = 10,
weight = 1,
color = ~pal(uuid),
stroke = FALSE,
fillOpacity = 100,
popup = ~popup) %>%
addLegend("bottomright", pal = pal, values = ~uuid,
title = "Specimen Records",
opacity = 1)
```
We can visualize this data on a map to better understand what the data quality flag is telling us. For example, in the map below you can see the effect of accidentally reversing the latitude on three example georeferenced specimen records.
```{r echo = FALSE}
map
```
iDigBio uses the value provided for "country" to identify issues and apply the flag "rev_geocode_flip_lat_sign." This is frequently helpful but not actually always correct. For example, here is a specimen record where the *country* has been recorded as "Antarctica" and georeferenced accordingly, then corrected incorrectly by iDigBio (probably because the data provider coordinates are farther offshore than the "country" of Antarctica extends to). It is important to recognize what kinds of data quality adjustments (good and bad) aggregators are making to your data because researchers may not know which set of coordinates to use.
```{r warning = FALSE, message = FALSE}
# Create map displaying example of record possibly assigned the
# rev_geocode_flip_lat_sign flag incorrectly
df_flagCoord %>%
filter(uuid == "004fa3d0-7d99-4af4-98b8-dd6c64e68906") %>%
select(uuid, matches("_lat|_lon")) %>%
unite(provider_coords, c("provider_lat", "provider_lon"), sep = ",") %>%
unite(aggregator_coords, c("aggregator_lat", "aggregator_lon"), sep = ",") %>%
gather(key = type, value = coordinates, -uuid) %>%
separate(coordinates, c("lat","lon"), sep = ",") %>%
mutate(lat = as.numeric(lat)) %>%
mutate(lon = as.numeric(lon)) %>%
arrange(uuid, type) %>%
mutate(popup = str_c(type, " = ", lat, ", ", lon, sep = "")) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(
lng = ~lon,
lat = ~lat,
radius = 10,
weight = 1,
color = ~pal(uuid),
stroke = FALSE,
fillOpacity = 100,
popup = ~popup) %>%
addLegend("bottomright", pal = pal, values = ~uuid,
title = "Specimen Record",
opacity = 1)
```
## Summarize and explore data
The iDigBio API provides a means for an institution to examine data quality issues across collections, which sometimes is not possible internally when data in different collections are managed in different databases.
```{r}
# Summarize flagged records by collection type
spmByColl <- df_flagCoord %>%
group_by(collectioncode) %>%
tally()
# Generate graph to display counts of flagged records by collection within the
# institution
graph_spmByColl <- ggplot(spmByColl,
aes(x = reorder(collectioncode, -n),
y = n,
fill = collectioncode)) +
geom_col() +
theme(panel.background = element_blank(),
legend.title = element_blank(),
axis.title.x = element_text(face = "bold"),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(size = 12, face = "bold")) +
labs(x = "collection",
y = "# of specimen records",
title = "LACM records flagged with geo-coordinate data quality issues by iDigBio") +
geom_text(aes(label = n, vjust = -0.5))
# Get count of total records published by the institution using function
# `idig_count_records`
totalInstSpm <- idig_count_records(rq = list(institutioncode = "lacm"))
# Calculate flagged records as percent of total records
percentFlagged <- sum(spmByColl$n)/totalInstSpm*100
```
For example, we can ask how many specimen records from which collections at the Natural History Museum of Los Angeles (LACM) have been flagged as "rev_geocode_corrected" by iDigBio. *As an aside, although this graph highlights the number of specimen records with data quality issues, these represent only `r round(percentFlagged, 2)`% of the total specimen records published by LACM.*
```{r echo = FALSE}
graph_spmByColl
```
We can also explore what *other* data quality flags these specimen records have been flagged with.
```{r}
# Collate `df_flagAssoc` to describe other data quality flags that are associated
# with rev_geocode_corrected in `df_flagCoord`
df_flagAssoc <- df_flagCoord %>%
select(uuid, flags) %>%
unnest(flags) %>%
group_by(flags) %>%
tally() %>%
mutate("category" = case_when(str_detect(flags, "geo|country|state")
~ "geography",
str_detect(flags, "dwc_datasetid_added|dwc_multimedia_added|datecollected_bounds")
~ "other",
str_detect(flags, "gbif|dwc|tax")
~ "taxonomy")) %>%
mutate("percent" = n/(nrow(df_flagCoord))*100) %>%
arrange(category, desc(n))
# Visualize associated data quality flags
ggplot(df_flagAssoc, aes(x = reorder(flags, -percent), y = percent, fill = category)) +
geom_col() +
theme(axis.title.x = element_text(face = "bold"),
axis.text.x = element_text(angle = 75, hjust = 1),
axis.ticks.y = element_blank(),
axis.title.y = element_text(face = "bold"),
plot.title = element_text(size = 12, face = "bold")
) +
labs(x = "additional iDigBio data quality flag",
y = "% specimen records",
title = "LACM records flagged for geo-coordinate issues are also flagged for...",
fill = "flag category")
```