-
Notifications
You must be signed in to change notification settings - Fork 2
/
empty-spaces.Rmd
630 lines (481 loc) · 48.4 KB
/
empty-spaces.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
---
title: "A Model Workflow for Locating Pliocene and Pleistocene Ice-Rafted Debris using GeoDeepDive"
author:
- Simon J. Goring^[Department of Geography, University of Wisconsin-Madison, USA]
- Jeremiah Marsicek^[Department of Geoscience, University of Wisconsin-Madison, USA]
- Shan Ye^[Department of Geoscience, University of Wisconsin-Madison, USA]
- John W. Williams^[Department of Geography, University of Wisconsin-Madison, USA]
- Stephen E. Meyers^[Department of Geoscience, University of Wisconsin-Madison, USA]
- Shanan E. Peters^[Department of Geoscience, University of Wisconsin-Madison, USA]
- Daven Quinn^[Department of Geoscience, University of Wisconsin-Madison, USA]
- Allen Schaen^[Department of Geoscience, University of Wisconsin-Madison, USA]
- Brad S. Singer^[Department of Geoscience, University of Wisconsin-Madison, USA]
- Shaun A. Marcott^[Department of Geoscience, University of Wisconsin-Madison, USA]
output:
html_document:
code_folding: show
highlight: pygment
keep_md: yes
number_sections: no
css: style/common.css
word_document:
toc: no
always_allow_html: yes
dev: svg
highlight: tango
bibliography: style/references.bib
csl: style/elsevier-harvard.csl
abstract: Machine learning technology promises a more efficient and scalable approach to locating and aggregating data and information from the burgeoning scientific literature. Realizing this promise requires provision of applications, data resources, and the documentation of analytic workflows. GeoDeepDive provides a digital library comprising over 11 million peer-reviewed documents and the computing infrastructure upon which to build and deploy search and text-extraction capabilities using regular expressions and natural language processing. Here we present a model GeoDeepDive workflow and accompanying R package to extract spatiotemporal information about site-level records in the geoscientific literature. We apply these capabilities in a case study to generate a preliminary distribution of ice-rafted debris (IRD) records in both space and time. We use regular expressions and natural language-processing utilities to extract and plot reliable latitude-longitude pairs from publications containing IRD, and also extract age estimates from those publications. This workflow and R package provides researchers from the geosciences and allied disciplines a general set of tools for querying spatiotemporal information from GeoDeepDive for their own science questions.
keywords: Text and data mining, Pliocene, Pleistocene, ice-rafted debris
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning = FALSE)
suppressMessages(library(dplyr, quiet = TRUE))
# Get document type!
docType <- knitr::opts_knit$get('rmarkdown.pandoc.to')
```
## Introduction
Peer-reviewed papers communicate knowledge to their audiences through figures, text and tables. These elements have been refined over generations to efficiently communicate scientific insight to human readers, but the volume and variety of the peer-reviewed literature has challenged meta-analysis and efficient extraction of the underlying primary data. Efforts to improve the practice of data archiving in structured, sustainable data repositories are increasing [@sansone2019fairsharing] as individuals and groups recognize the importance of data sharing and curation [@pagesSteering]. Despite these efforts, however, a large volume of data and information still exists exclusively in published form as text within manuscripts, embedded in tables, or graphically within figures. In response, new automated software tools are being developed to extract information directly from the scientific literature. Various fields are developing tools for automated extraction of meaningful information from the scientific literature, including natural language processing (NLP) and other forms of machine learning (ML), the vast majority of which are being developed and deployed to extract information from general and freely available content, like Twitter feeds and publication abstracts. The development of these new software tools has rapidly outpaced their application in the geosciences, which could allow the evaluation of digital libraries and infrastructures which address questions at scales not available to traditional meta-analyses.
GeoDeepDive ([http://geodeepdive.org]()) is a digital library and computing system that currently contains over 11 million publications from multiple commercial and open-access content providers. Early versions of GeoDeepDive have been used to extract fossil occurrences from the scientific literature [@peters2014machine], e.g. to better understand the temporal patterns and possible drivers of stromatolite resurgences in the geological past [@peters2017rise]. However, the newness of GeoDeepDive as a platform and the few available software tools to leverage it has limited its impact [@marsicek2018automated]. Here we provide a sample workflow and accompanying R package that leads the reader through key, public elements of GeoDeepDive. As a case study, we retrieve documents on the distribution of ice-rafted debris (IRD) from the Pliocene to present and extract both geographic coordinates and temporal information. We choose to focus our effort on IRD because of the near uniqueness of the acronym in the geoscience literature and because IRD is almost exclusively restricted to ocean settings, thus simplifying identification of false positives – occurrences of IRD that do not refer to ice-rafted debris – in the training dataset. IRD distribution in marine sediments provides a key constraint on cryosphere development, yielding insight into past climate evolution [*e.g.*, @hemming2004heinrich;@ruddiman_late_1977;@andrews_abrupt_1998;@heinrich1988origin;@bond_iceberg_1995;@clark_mechanisms_2007;@marcott_ice-shelf_2011;@alvarez-solas_links_2010;@barker_icebergs_2015;@bassis_heinrich_2017].
One implementation of GeoDeepDive uses sentences as the atomic unit, managing the sentence-level data within a PostgreSQL database. Each sentence within a paper is identified by 1) a unique document id (GDDID, or `gddid` in the accompanying code; an internally unique identifier to accommodate publications that may or may not have their own formal digital object identifier [DOI]) and 2) a sentence number that is assigned and unique within the paper. A separate table relates GDDIDs to publication metadata (e.g. title, journal, authors, etc.). Hence, GeoDeepDive workflows and their individual steps effectively operate at two distinct levels: sentence and document-level. Because GeoDeepDive (GDD) also provides unique IDs for each journal and links these to the sentence IDs, journal-level analytics are possible. GDD makes use of Stanford NLP [@manning2014stanford], so it is also possible to obtain word-level analysis using indexing within sentences. For this paper, we focus on "sentence"- and "document"-level analytics, because of our focus on extracting geospatial and temporal information for individual records.
![](images/flowchart.svg)
**Figure 1**: *Workflow used to go from a list of documents that mention ice-rafted debris (IRD; IRD is the actual search string in this case) and (Pliocene or Pleistocene or Holocene) to a vetted set of the documents, and finally a summary of the documents and relevant information.*
This paper presents a sample workflow, intended to provide meaningful but preliminary results, with the main goal of illustrating the potential of sentence-level query capabilities in GDD, and showing potential users how GDD can be used to extract information from text. In this example workflow, we will identify papers with mentions of ice-rafted debris (IRD) in the Pliocene and Pleistocene (Fig. 1), extract space and time coordinates using a new R toolkit called geodiveR ([http://github.com/EarthCubeGeochron/geodiveR]()), and store the data and code in a GitHub repository ([http://github.com/EarthCubeGeoChron]()).
Many publications document the existence of IRD at the level of individual marine drilling sites, but assembling this information across publications into large-scale mapped syntheses is a non-trivial task that has traditionally taken years of painstaking literature compilation [@stern2013north;@ruddiman_late_1977;@heinrich1988origin;@hemming2004heinrich]. A comprehensive, accurate database of IRD deposits and their spatial distribution -- extracted through a combination of advanced software and human expertise from the published scientific literature -- can help the scientific community better understand and characterize ice sheet dynamics over the last 5.3 million years, ideally leading to a better understanding of how glaciers respond to changes in climate and ocean circulation.
The goal is not to remove the expert (*e.g*., sedimentologist, paleoceanographer) from decision-making processes, but, rather, to show how the DeepDive infrastructure can be employed to perform research tasks more efficiently. Various subtleties and complexities persist that are not yet tractable to machine learning. For example, ice-rafted debris is part of a complex of sedimentary deposits within ocean sediments, including iceberg, ice shelf, and sea ice-rafted debris [@powell_glacimarine_1984]. Differences between the processes resulting in sediment entrainment and deposition between these types of sediment may challenge interpretation [@andrews_abrupt_1998]. Sedimentological features evaluated by the expert can be used to differentiate particle sizes and shapes to offer a better understanding of the sources of sediments identified as ice-sourced [@st2015microfeatures], and thus provide a more complete picture of the processes that led to deposition. Furthermore, the geospatial and temporal information retrieved by GDD require vetting.
Nonetheless, by providing a comprehensive corpus of documents, with identified publications, timings, and locations for identified deposits it becomes possible to identify potential outliers or misidentified samples, to re-sample existing physical cores where there may be uncertainty in identification, and ultimately, to generate a more complete model of sea-ice patterns during the Pleistocene. As a first step forward, this paper provides a model workflow to be carried out with the DeepDive infrastructure. We begin with a general walkthrough of the analytical steps and various considerations that arise at each stage, then move to a specific walkthrough that focuses on identifying papers with IRD records from the Pliocene and Pleistocene, extracting spatial and temporal coordinates, and mapping the returned results. We show that a relatively simple framework is already able to recover a substantial body of useful information that can inform further data processing, cleaning, and interpretation that would be part of a more formal analysis of paleo-oceanographic patterns and climate evolution during the Pleistocene.
## Workflow Overview
### Initial Returns and RegEx
Processing the entirety of the documents within GDD is time-consuming because the body of accessible papers (the corpus) contains over 11 million peer-reviewed publications. For any given goal, such as finding all instances of IRD, only a small fraction of the total corpus is relevant. Keywords are a common first solution to reducing data volume. Particular keywords within a document can be used to identify the subset of all documents that are potentially relevant. The GDD public API ([https://geodeepdive.org/api]()) supports simple string matching using keywords. Here, we chose terms that would return a sufficient breadth of documents for this trial study. We used the acronym "IRD", as well as constraints on geologic time intervals, "Holocene", "Pleistocene" and "Pliocene". This API query returned hits from 5,315 documents within the greater GeoDeepDive corpus.
Useful information about these terms can be derived from the "snippets" route of GeoDeepDive's API. Snippets harnesses an ElasticSearch index spanning the full text of all PDFs that have a "native" text layer (*i.e*., PDFs with searchable text in them already, which constitutes the vast majority of PDFs distributed by journal publishers):
```
https://geodeepdive.org/api/snippets?term=IRD&full_results=true
```
```{r getSimpleApi, echo=FALSE}
ird_alone <- jsonlite::fromJSON('https://geodeepdive.org/api/snippets?term=IRD&full_results=true')
ird_hits <- ird_alone$success$hits
```
The response to this API call is a JSON object that indicates the total number of "hits" of the term (n = `r ird_hits` as of `r Sys.Date()`), basic bibliographic citation information for each document containing the term, including a link to the original PDF distributed by the publisher, and a "snippet" of text around mentions of the term in the full text of the document:
```{r simpleApiReturn, echo = FALSE, results='as-is'}
jsonlite::toJSON(ird_alone$success$data[1,], pretty=TRUE) %>%
jsonlite::prettify(indent = 2)
```
By default, the matching terms are highlighted with HTML tags (to remove the tags, the parameter `&clean=true` can be added to the URL for the `snippet` API). Because of the large number of results in the case above, the results include a link to the next page of documents containing the term, allowing the user to scroll through large sets of results.
```{r getCombinedApi, echo=FALSE}
ird_alone <- jsonlite::fromJSON('https://geodeepdive.org/api/snippets?term=IRD,Pleistocene&inclusive=TRUE&full_results=true')
ird_excl_hits <- ird_alone$success$hits
```
The GeoDeepDive `snippet` API also supports searches that combine multiple terms. The following API call, which returns only papers that contain both "IRD" and "Pleistocene" returns `r ird_excl_hits` hits as of `r Sys.Date()`, far fewer than the more expansive search.
```
https://geodeepdive.org/api/snippets?term=IRD,Pleistocene&inclusive=TRUE&full_results=true
```
The GeoDeepDive API, however, is not designed to provide full functionality, but rather is designed to be deployed in user-constructed applications. Hence, the GDD API is suitable only for limited data extraction tasks. More powerful analyses require analysis of the PostgreSQL representation of the document text data.
Here, text matching and data extraction from the retrieved body of papers in PostgreSQL uses existing PostgreSQL text and string functions, plus regular expression matching in R using the stringr package [@stringr]. The use of the Stanford NLP library also allows GDD workflows to take advantage of parts-of-speech tagging, and more advanced NLP tools, but these capabilities are not employed in this demonstration workflow.
### Subsetting and Cleaning
We begin our analysis with a subset of the data, consisting of 150 papers, that was pared down from the total set of the corpus, using the keyword constraints described above. The subset of papers may still include papers that are not appropriate (i.e. IRD may refer to something other than ice-rafted debris). To obtain a training dataset, we execute a second round with the same text matching, using the same keywords and rules used at the document level, but now enforce this at the sentence level. For example, "IRD" must be located in a sentence with another term (e.g., "IRD" and "Holocene"). These additional rules restrict the total list returned to 81 documents for which any sentence contains a match to the keyword.
Searching for IRD as a keyword retrieves articles that use IRD as an acronym for Ice-Rafted Debris, but it also, for instance, retrieves articles mentioning the French Research **I**nstitute of **R**esearch for **D**evelopment. Throughout this paper we will refer to *rules*; generally these are statements that can resolve to a boolean (TRUE/FALSE) output. So for example, within our subset we could search for all occurrences of IRD and CNRS:
```{r noEvalStringrExample, eval=FALSE, echo=TRUE}
sentence <- "this,is,a,IRD,and,CNRS,sentence,we,didnt,want,."
stringr::str_detect(sentence, "IRD") & !stringr::str_detect(sentence, "CNRS")
```
This statement will evaluate to `TRUE` if `IRD` appears in a sentence without `CNRS`. If we apply this sentence-level test at the document level (`any(test == TRUE)`) we can estimate which papers are most likely to have the correct mention of `IRD` for our purposes. This then further reduces the number of papers (and sentences) for our training dataset.
## Extracting Data
After cleaning and subsetting, we develop a series of tests and workflows to iteratively extract information. In many cases this requires further text matching, and packages in R such as stringr were useful for accomplishing this task. Additional support can come from the NLP output that can be generated for the data. In all of these cases, we generate clear rules to be tested, and then apply them to the document.
Because understanding both the IRD distribution in ocean sediments and the timing of the deposition of IRD through the Pliocene and Pleistocene is critical for interpreting past ice dynamics, spatial coordinates and geochronologic constraints of the IRD deposits need to be identified within the paper. As with the earlier cleaning process, any paper that contains neither spatial coordinates or ages, or one but not the other, is here deemed not of interest. Less restrictive searches could be defined.
Extracting the spatial location and age of an IRD occurrence within a paper that contains "IRD", however, is not sufficient. We need to be able to distinguish between an age related to the event we are interested in versus an age reported in a paper for some other reason. So, again, we must develop general rules that allow distinguishing of all ages from ages of interest, and all spatial locations from spatial locations of interest.
## Exploratory Iteration
There are a number of reasons to continue to refine the rules used in this workflow to discover data. First, extraction of text from the PDF and optical character recognition (OCR) are not always accurate, so that some sentences and words are parsed incorrectly. This problem is particularly acute for geographic coordinates (see below). Second, many words have multiple meanings, leading to false positives if only string-matching is used, as the IRD example illustrates. Third, semantic terms and concepts often vary subtly within and among disciplines and journals. As a simple example, if we were interested in retrieving paleoecological information we would need to know that *paleoecology* and *palaeoecology* refer to essentially identical concepts. Similarly, ice-rafted debris may also be referred to as sand-sized layers in the marine context [e.g. @ruddiman_north_1977], while a paleoceanographer might want careful separation among different kinds of IRD, e.g. iceberg-rafted debris (IBRD), ice shelf rafted debris (ISRD), or sea ice-rafted debris (SIRD). Fourth, the context and placement of words matters. For example, temporal information like 'Holocene' and 'Pliocene' may be found in the Methods, where they refer to marine core locations, or in the Discussion, where they might refer to global climate trends.
Some potential pitfalls include:
* OCR matching - commonly mistaken letters (O, Q, o)
* Age reporting variety in units or time datum (e.g., kyr vs. ka)
* Age reporting variety (e.g. radiocarbon years vs. calendar years vs. stratigraphic units)
* GDD sentence construction (failure of GDD to identify proper sentence bounds, complex punctuations)
Repeatedly reviewing matches at the sentence level and at the document level (i.e., "Why did this match or why didn't this paper return a match?"), then refining the workflow rule-sets accordingly, is critical to developing a clear workflow and high-value corpus. In many cases, beginning with very broad tests and slowly paring down to more precise tests is an appropriate approach. In this case, tools like `RMarkdown` are very helpful for interactive data exploration, using packages like `DT` [@dt2020] and `leaflet` [@leaflet2019]. We can assess the distribution of age-like elements within a paper and determine if they match with our initial expectations (e.g. "Why does the article 'Debris fields in the Miocene' contain Holocene-aged matches?"; "Why does a paper about Korea report locations in Belize?"). Depending on the success of the algorithm, the tests can be revised and the process repeated until an acceptable match is found.
### Reproducible and Validated Workflows for a Dynamic Literature
As the GDD workflow develops and refines, we can begin to report on patterns and findings. Some of these may be semi-qualitative (e.g. "The majority of sites are dated to the LGM"), while others may involve statistical analysis (e.g., "The presence of IRD increases after the Mid-Pliocene Transition (*p<0.05*)"). In an analysis where the underlying dataset is static or a version has been frozen, it is reasonable to develop a paper and report these findings.
However, the publication database in GDD is far from static; the GDD infrastructure acquires more than 10,000 papers per day from multiple sources. Given this, some patterns will change over time as more information is brought to bear. For example, a new ocean drilling campaign might reveal new insights into the spatiotemporal distribution of IRD, or the addition of new records may reveal previously undiscovered search artifacts within the publication record. For this reason the use of assertions or testable statements that can be evaluated to TRUE or FALSE within the workflow become critically important. In R we can use the `assertthat` package [@assertthat2019] to support test-driven development and assertions within the workflow.
Test-driven development is common in software development. As developers create new features, a good practice is to first develop tests for the features, to ensure that feature behavior matches expectations. The analogy in our scientific workflow is that findings are features, and as we report them, we want to be assured that those findings are valid. In R the assertthat package provides a tool for testing statements, and providing robust feedback through custom error messages [@assertthat2019].
```{r noEvalDateSummary, eval=FALSE}
howmany_dates <- all_sentences %>%
mutate(hasAge = stringr::str_detect(words,
"regular expression for dates")) %>%
group_by(gddid) %>%
summarise(age_sentences = any(hasAge),
n = n())
# We initially find that less than 10% of papers have dates in them, and report that as an important finding in the paper.
percent_ages <- sum(howmany_dates$age_sentences) / nrow(howmany_dates)
assertthat::assert_that(percent_ages < 0.1,
msg = "More than 10% of papers have ages.")
```
**Text Box 1**. *From sentences returned by GDD (all_sentences) create a new column called 'hasAge' that is a boolean variable to test whether an age is found. Once the variable has been created, group the data to test whether any of the papers (each with a unique gddid) has an age reported in the paper, and then count the total number of reported ages within the paper. If we expect that most papers won't have reported ages, we can create an assertion, in this case that only 10% of papers have a reported age. If this assertion fails, then a custom error message is returned.*
With this workflow overview we now have mapped out an iterative process that is also responsive to the underlying data. We have developed clear tests under which our findings are valid. We can create a document that combines our code and text in an integrated manner, supporting FAIR Principles [@wilkinson2016fair], and supporting the next generation of reproducible research. In the following section we run through this workflow in detail.
## Ice-Rafted Debris Case Study and Discussion
### Finding Spatial Matches
To begin, we load the packages that will be used, and then import the data:
```{r hide_devtools, results='hide', message=FALSE, echo=FALSE, warning=FALSE}
library(devtools)
devtools::install_github('EarthCubeGeochron/geodiveR')
```
```{r load_data, message=FALSE, warning = FALSE}
library(geodiveR)
library(ggplot2)
library(jsonlite)
library(readr)
library(dplyr)
library(stringr)
library(leaflet)
library(purrr)
library(DT)
library(assertthat)
library(mgcv)
library(flextable)
library(ggthemes)
sourcing <- list.files('R', pattern = ".R$", full.names = TRUE) %>%
map(source, echo = FALSE, print = FALSE, verbose = FALSE)
publications <- fromJSON(txt = 'input/bibjson', flatten = TRUE)
if(!file.exists('input/sentences_nlp352')){
data(nlp)
full_nlp <- nlp
rm(nlp)
colnames(full_nlp) <- c('_gddid', 'sentence', 'wordIndex',
'word', 'partofspeech', 'specialclass',
'wordsAgain', 'wordtype', 'wordmodified')
} else {
full_nlp <- readr::read_tsv('input/sentences_nlp352',
trim_ws = TRUE,
col_names = c('_gddid', 'sentence', 'wordIndex',
'word', 'partofspeech', 'specialclass',
'wordsAgain', 'wordtype', 'wordmodified'))
}
nlp_clean <- clean_corpus(x = full_nlp, pubs = publications)
nlp <- nlp_clean$nlp
```
This code produces an output object that includes a key for the publication (`_gddid`, linking to the `publications` variable), the sentence number of the parsed text, and then both the parsed text and some results from natural language processing. We also obtain a list of gddids to keep or drop given the regular expressions we used to find instances of IRD in the affiliations or references sections of the papers. This analysis returns `r length(unique(nlp[,"_gddid"]))` publications:
```{r demo_table, warning=FALSE, echo = FALSE}
short_table <- nlp %>%
filter(1:nrow(nlp) %in% 1) %>%
str_replace('-LRB-', '(') %>%
str_replace('-RRB-', ')') %>%
as.data.frame(stringsAsFactors = FALSE)
rownames(short_table) <- colnames(nlp_clean)
colnames(short_table) <- 'value'
short_table[nchar(short_table[,1])>40,1] <-
paste0('<code>', substr(short_table[nchar(short_table[,1])>40, 1], 1, 30), ' ... }</code>')
rownames(short_table) <- colnames(nlp)
short_table$description <- c("Unique article identifier",
"Unique sentence identifier within article",
"Index of words within sentence",
"Verbatim OCR word",
"Parts of speech, based on <a href='https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html'>Penn State TreeView</a>",
"Special classes (numbers, dates, &cetera)",
"Words again",
"Word types, based on <a href='http://universaldependencies.org/introduction.html'>universal dependencies</a>.",
"The word that the <code>wordtype</code> is modifying.")
if(docType == 'docx') {
flextable::flextable(short_table)
} else {
short_table %>%
datatable(escape = FALSE, rownames = TRUE, options = list(dom='t'))
}
```
**Table 1.** Columns returned by GeoDeepDive for use in text matching and extraction, along with an example of each column entry.
### Extracting Spatial Coordinates
We are interested in using GDD to obtain site coordinates for locations that contain IRD data spanning the Pliocene and Pleistocene. The goal is to provide relevant site information for use in meta-analysis, or for comparing results to existing geographic locations from the relevant geocoded publications and then linking back to the publications using DOIs.
To obtain geographical coordinates from the paper we must consider several potential issues. The first is that not all coordinates will necessarily refer to an actual ocean core. We may also, inadvertently, find numeric objects that appear to be coordinates, but are in fact simply numbers. Therefore, we must identify what exactly we think coordinates might look like and build a regular expression (or set of regular expressions) to accurately extract these values. Since we will process DMS coordinates differently than DD coordinates, we generate two regular expressions:
```{r regex_degrees}
dms_regex <- "[\\{,]([-]?[1]?[0-9]{1,2}?)(?:(?:,[°◦o],)|(?:[O])|(?:,`{2},))([1]?[0-9]{1,2}(?:.[0-9]*)),[′'`]?[,]?([[0-9]{0,2}]?)[\"]?[,]?([NESWnesw]?),"
dd_regex <- "[\\{,][-]?[1]?[0-9]{1,2}\\.[0-9]{1,}[,]?[NESWnesw],"
```
These regular expressions allow for negative or positive coordinate systems, that may start with a `1`, and then are followed by one or two digits (`{1,2}`). From there we see differences in the structure, reflecting the need to capture the degree symbols, or, in the case of decimal degrees, the decimal component of the coordinates. The regular expressions are more rigorous (i.e., have fewer matching options) for the decimal degrees than for DMS coordinates. DMS coordinates reported in documents may be interpreted in non-standard ways by OCR.
The regex commands were constructed to work with the `stringr` package [@stringr2019], so that we obtain five elements from any match, including the full match, the degrees, the minutes, the seconds (which may be an empty string), and the quadrant (NESW).
```{r, regex_apply}
degmin <- str_match_all(nlp$word, dms_regex)
decdeg <- str_match_all(nlp$word, dd_regex)
```
We expect that all coordinates are reported as pairs within sentences, and so we are most interested in finding all sentences that contain pairs of coordinates. We start by finding the publications with sentences that have coordinate pairs:
```{r, coord_dtable, echo = FALSE}
coord_pairs <- sapply(degmin, function(x)length(x) %% 2 == 0 & length(x) > 0) |
sapply(decdeg, function(x)length(x) %% 2 == 0 & length(x) > 0)
wide_coords <- nlp %>%
filter(coord_pairs) %>%
inner_join(publications, by = "_gddid") %>%
select('gddid' = `_gddid`, sentence, word, year, title) %>%
mutate(word = gsub(',', ' ', word)) %>%
mutate(word = str_replace_all(word, '-LRB-', '(')) %>%
mutate(word = str_replace_all(word, '-RRB-', ')')) %>%
mutate(word = str_replace_all(word, '" "', ','))
if(docType == "docx") {
wide_coords %>%
select(-c('gddid', 'sentence')) %>%
head() %>%
flextable::flextable()
} else {
wide_coords %>%
select(-c('gddid', 'sentence')) %>%
datatable(escape = FALSE, rownames = TRUE, options = list(dom='t'))
}
```
**Table 2**. Sample sentences from the IRD corpus that contain matches to the coordinate regular expression rule.
Even here, we can see that many of these matches work, but that some of the matches are incomplete. There appears to be a much lower proportion of sites returned than we might otherwise expect. Given that there are `r length(unique(nlp$"_gddid"))` articles in the NLP dataset with matches to IRD related terms, it is surprising that only `r length(unique(wide_coords$gddid))` appear to support regex matches to coordinate pairs. We would expect that the description of sites or locations using coordinate pairs should be common practice. The observed outcome is likely to be, in part, an issue with the OCR/regex processing. A next iterative step would be to review potential matches more thoroughly to find additional methods of detecting the coordinate systems.
### Converting Coordinates
Given the geographic coordinate strings, we need to be able to transform them to reliable latitude and longitude pairs with sufficient confidence to actually map the records. These two functions convert the GeoDeepDive (GDD) word elements pulled out by the regular expression searches into decimal degrees that can account for reported location and be applied to a map, to look at described and extracted locations.
```{r, extract_coords}
convert_dec <- function(x, i) {
drop_comma <- gsub(',', '', x) %>%
substr(., c(1,1), nchar(.) - 1) %>%
as.numeric %>%
unlist
domain <- (str_detect(x, 'N') * 1 +
str_detect(x, 'E') * 1 +
str_detect(x, 'W') * -1 +
str_detect(x, 'S') * -1) *
drop_comma
publ <- match(nlp$`_gddid`[i], publications$`_gddid`)
point_pairs <- data.frame(sentence = nlp$sentence[i],
string = nlp$word[i],
lat = domain[str_detect(x, 'N') | str_detect(x, 'S')],
lng = domain[str_detect(x, 'E') | str_detect(x, 'W')],
publications[publ,],
stringsAsFactors = FALSE) %>%
rename(gddid = X_gddid)
return(point_pairs)
}
convert_dm <- function(x, i) {
# We use the `i` index so that we can keep the coordinate outputs from the
# regex in a smaller list.
dms <- data.frame(deg = as.numeric(x[,2]),
min = as.numeric(x[,3]) / 60,
sec = as.numeric(x[,4]) / 60 ^ 2,
stringsAsFactors = FALSE)
dms <- rowSums(dms, na.rm = TRUE)
domain <- (str_detect(x[,5], 'N') * 1 +
str_detect(x[,5], 'E') * 1 +
str_detect(x[,5], 'W') * -1 +
str_detect(x[,5], 'S') * -1) *
dms
publ <- match(nlp$`_gddid`[i], publications$`_gddid`)
point_pairs <- data.frame(sentence = nlp$sentence[i],
string = nlp$word[i],
lat = domain[x[,5] %in% c('N', 'S')],
lng = domain[x[,5] %in% c('E', 'W')],
publications[publ,],
stringsAsFactors = FALSE) %>%
rename(gddid = X_gddid)
return(point_pairs)
}
coordinates <- list()
coord_idx <- 1
for (i in 1:length(decdeg)) {
if ((length(decdeg[[i]]) %% 2 == 0 |
length(degmin[[i]]) %% 2 == 0) & length(degmin[[i]]) > 0) {
if (any(str_detect(decdeg[[i]], '[NS]')) &
sum(str_detect(decdeg[[i]], '[EW]')) == sum(str_detect(decdeg[[i]], '[NS]'))) {
coordinates[[coord_idx]] <- convert_dec(decdeg[[i]], i)
coord_idx <- coord_idx + 1
}
if (any(str_detect(degmin[[i]], '[NS]')) &
sum(str_detect(degmin[[i]], '[EW]')) == sum(str_detect(degmin[[i]], '[NS]'))) {
coordinates[[coord_idx]] <- convert_dm(degmin[[i]], i)
coord_idx <- coord_idx + 1
}
}
}
coordinates_df <- coordinates %>% bind_rows %>%
mutate(sentence = gsub(',', ' ', sentence)) %>%
mutate(sentence = str_replace_all(sentence, '-LRB-', '(')) %>%
mutate(sentence = str_replace_all(sentence, '-RRB-', ')')) %>%
mutate(sentence = str_replace_all(sentence, '" "', ','))
coordinates_df$doi <- coordinates_df$identifier %>% map(function(x) x$id) %>% unlist
```
```{r leaflet_plot, echo=FALSE}
if (docType == 'docx') {
library("rnaturalearth")
library("rnaturalearthdata")
world <- ne_countries(scale = "medium", returnclass = "sf")
outplot = ggplot(data = world) +
geom_sf(fill="#a9a9a9", color="#2b2d2faa") +
geom_point(data = coordinates_df, aes(x = lng, y = lat)) +
theme_tufte() +
xlab('') +
ylab('') +
coord_sf(xlim=c(-100,150), ylim=c(-80,77)) +
theme(
panel.border = element_rect(colour = "black", fill=NA, size=1)
)
ggsave(outplot, file="images/mapplot.svg")
outplot
} else {
leaflet(coordinates_df) %>%
addProviderTiles(providers$Esri.WorldImagery) %>%
addCircleMarkers(popup = paste0('<b>', coordinates_df$title, '</b><br>',
'<a href=https://doi.org/',
coordinates_df$doi,'>Publication Link</a><br>',
'<b>Sentence:</b><br>',
'<small>',
gsub(',', ' ', coordinates_df$string),
'</small>'))
}
```
**Figure 3**: *Map depicting sites that mention IRD and contain coordinate information and an IRD event. Each dot can be clicked to pull up: the title of the paper, the link to the publication, the sentence(s) containing spatial coordinates and other relevant IRD information. Out of the 150 papers in the test dataset, we found 11 papers with extractable spatial coordinates, comprising a total of 30 coordinate pairs (References included in leaflet map include: [@baeten2014origin;@cofaigh2001late;@stoner1994high;@simon2013millennial;@seki2005decreased;@rosell1997biomarker;@rea2016magnetic;@prokopenko2002stability;@kaboth2016new;@ikehara2007millennial;@incarbona2008holocene].*
After cleaning and subsetting the corpus, we find `r length(unique(coordinates_df$doi))` papers with `r nrow(coordinates_df)` coordinate pairs out of `r nrow(publications)` documents in the IRDDive test dataset (Figure 3). This test suggests further improvements that can be made to the current methods. First, in some cases, we find papers where IRD is simply referenced, and the paper does not report primary data. Second, in some cases we find IRD but no coordinates or other core metadata; some papers simply do not contain coordinate information. Third, some papers mention IRD in the core data for continental cores (see Fig. 3 Central Asia location). These might possibly be valid instances of IRD in lacustrine deposits [@smith2000diamictic], or might represent papers that mention IRD without representing primary data. One solution would be to cross-reference the returned IRD coordinates with the location of continents (past or present) and remove coordinate pairs that fall within the continental boundaries. A fourth and last step would be to further refine the regex to obtain additional mentions of IRD, e.g. as 'IRD-rich layers', 'IRD-rich deposits', 'Heinrich layers', etc.
### Extracting ages and age ranges
The next step is to extract ages and age ranges that may be associated with IRD events. This requires building regular expressions that pull dates with many different naming conventions (e.g., years BP, kyr BP, ka BP, a BP, Ma BP, *etc*.)
```{r, getDates, eval=TRUE}
is_date <- str_detect(nlp$word, ",BP,")
is_range <- str_detect(nlp$word, "(\\d+(?:[.]\\d+)*),((?:-{1,2})|(?:to)),(\\d+(?:[.]\\d+)*),([a-zA-Z]+,BP),")
date_range <- str_extract_all(nlp$word,
"(\\d+(?:[.]\\d+)*),((?:-{1,2})|(?:to)),(\\d+(?:[.]\\d+)*),([a-zA-Z]+,BP),") %>%
map(function(x) {
data.frame(range = paste(x, collapse = ";"),
stringsAsFactors = FALSE)
}) %>%
bind_rows() %>%
mutate(gddid = nlp$`_gddid`,
sentence = nlp$sentence,
words = nlp$word)
# Extract the ranges within sentences. If there are multiple ranges then
# collapse them into a single string using a semi-colon.
number <- str_extract_all(nlp$word,
",(\\d+(?:[\\.\\s]\\d+){0,1}),.*?,yr.*?,") %>%
map(function(x) {
data.frame(age = paste(x, collapse = ";"),
stringsAsFactors = FALSE)
}) %>%
bind_rows() %>%
mutate(gddid = nlp$`_gddid`,
sentence = nlp$sentence) %>%
left_join(x = ., y = date_range, by = c('gddid', 'sentence'))
```
```{r getTotalLines, echo=FALSE}
total_count <- number %>%
group_by(gddid) %>%
summarise(n = n()) %>%
full_join(x = ., y = number, by = c('gddid')) %>%
mutate(paper_loc = sentence / n) %>%
select(gddid, sentence, age, range, words, paper_loc)
```
When we apply the code to the documents in the IRD corpus we begin to see patterns of dates (Table 3). A limitation of the regular expression presented here is that the regular expression for the variable `is_range`, would only match `BP` ages prepended by a scale (a, ka, Ma). As currently defined, `is_range` only allows one term between matched numbers and the string `BP`. Thus more detailed methods would need to be used to capture all age descriptors and types. Ultimately, multiple matching terms are likely required to find the breadth of age terms. In this simple example, we see that extracted date ranges capture only late-Pleistocene dates (Figure 4; most < 40ka BP), and, an examination of the sentences containing the described ranges show that a large number of them refer to specific periods within the Pleistocene or Holocene ("Early Holocene wet period", "HS1", "Mid-Holocene"), which explains, in part, the high number of ranges that describe bounding ages, for example, the time around the beginning of the Holocene and the end of the Younger Dryas which may be double counted, resulting in the spike in references for the 11.0-11.5 ka BP in Figure 4. The lack of Pleistocene IRD references in the corpus is likely a result of an under-representation in our smaller sample of the dataset, rather than an issue with the method as described.
```{r dateRange, echo=FALSE}
if(docType == 'docx') {
date_range %>%
arrange(sample(1:nrow(date_range))) %>%
filter(!range == "") %>%
distinct(range, .keep_all=TRUE) %>%
head() %>%
flextable::flextable()
} else {
date_range %>%
arrange(sample(1:nrow(date_range))) %>%
filter(!range == "") %>%
distinct(range, .keep_all=TRUE) %>%
datatable(escape = FALSE, rownames = TRUE, options = list(dom='t'))
}
```
**Table 3**. *A sample of recovered age ranges reported in the NLP corpus that is associated with IRD records based on regular expression captures. The sentence number within a paper provides some indication of whether the age range is identifying broader global patterns of change (age ranges defined in the paper's Introduction), or elements specific to the current study.*
At this stage of the workflow, we have successfully identified instances of dates in the papers where there are references to IRD (Table 1), but we do not know if these ages are associated with specific IRD events. Nonetheless, when we map the data (Figure 3), we observe clusters that are consistent with IRD distributions found in the primary literature but with some notable exceptions (e.g., over land). A next step would be to begin to match the dates to specific units.
```{r, rangePlot, echo = FALSE, fig.width=6, dev='svg'}
rangeseq <- total_count$range %>%
map(function(x){
strsplit(x, ';')}) %>%
unlist() %>%
map(function(x){
strsplit(x,',') %>%
unlist() %>%
as.numeric %>% na.omit()
}) %>%
map(function(x){
seq(floor(sort(x)[1] * 2)/2, floor(sort(x)[2]*2)/2, by = .5)}) %>%
unlist()
ggplot(data = data.frame(year = rangeseq),
aes(x = year)) +
geom_histogram(binwidth=0.5, color="gray30", fill = "gray70") +
coord_cartesian(xlim=c(0, 41), ylim=c(0, 17), expand = FALSE) +
scale_x_continuous(breaks = seq(0,40, by = 5)) +
scale_y_continuous(breaks = seq(0,16, by = 2)) +
ggthemes::theme_tufte() +
theme(axis.title.x=element_text(size=18, face="bold.italic"),
axis.title.y=element_text(size=18, face="bold.italic"),
axis.text.x=element_text(size=14, face="bold"),
axis.text.y=element_text(size=14, face="bold")) +
xlab("Thousands of years before present (ka BP)") +
ylab("Described ranges including date (n = 81)")
```
**Figure 4**. *Incidence of time periods covered by ranges described in papers (e.g., "5 - 7 ka BP"). Individual papers may report multiple time ranges, and not all time ranges reported may be directly related to Ice-Rafted Debris events.*
One avenue for further exploration is to use the sentence position within the document to provide context for a retrieved geographic coordinate or age. For example, the distribution of coordinates within papers shows a well defined pattern (Figure 4). Coordinates are generally presented in the Abstract, Introduction or Methods section, but rarely elsewhere. This differs from the distribution of ages and age ranges within papers, which appear throughout papers, although age ranges tend to be most frequent in the results, discussion and conclusions. Age information also often appears in the titles of papers presented in the References section. This is a promising avenue for further study.
Of course, while the methods presented here do extract ages and geographic locations, it should be clear that this alone is not sufficient to fully understand process. Rather, this demonstration shows how these new capabilities will enable unprecedentedly detailed and powerful searches into the scientific literature, to accelerate the pace and scale of scientific synthesis and insight.
```{r ggplotAges, echo = FALSE, fig.width=6, dev='svg'}
position <- total_count %>%
mutate(abstract = str_detect(words, "Abstract,([A-Z]|\\.\\})"),
methods = str_detect(words, "Methods,([A-Z]|\\.\\})"),
refs = str_detect(words, "References,([A-Z]|\\.\\})"),
disc = str_detect(words, "Discussion,([A-Z]|\\.\\})"),
concl = str_detect(words, "Conclusion,([A-Z]|\\.\\})"),
intro = str_detect(words, "Introduction,([A-Z]|\\.\\})"),
results = str_detect(words, "Results,([A-Z]|\\.\\})")) %>%
select(-sentence, -age, -range) %>%
tidyr::pivot_longer(-c("paper_loc", "words", "gddid")) %>%
filter(value == TRUE) %>%
mutate(ranker = match(name, c('abstract', 'intro', 'methods', 'results', 'disc', 'concl', 'refs'))) %>%
group_by(gddid) %>%
mutate(order = sort(ranker)) %>%
filter(order <= ranker) %>%
select(c(-order, ranker)) %>%
group_by(name) %>%
summarise(mean = mean(paper_loc),
sd = sd(paper_loc)) %>%
arrange(mean) %>%
mutate(xmin = ifelse((mean - sd) < 0, 0, mean - sd),
xmax = mean + sd,
ymin = seq(14.8, 12.2, length.out = 7),
ymax = seq(14.8, 12.2, length.out = 7) + 0.2) %>%
mutate(name=factor(name, levels=c('abstract', 'intro', 'methods', 'results', 'disc', 'concl', 'refs')))
locations <- coordinates_df %>%
mutate(sentence = as.numeric(sentence)) %>%
select(gddid, sentence, year) %>%
full_join(x = total_count,
y = ., by=c("gddid", "sentence")) %>%
mutate(age = ifelse(age == "", NA, age),
range = ifelse(range == "", NA, range),
interval = round(paper_loc * 25, 0) / 25) %>%
group_by(interval) %>%
summarise(age = sum(!is.na(age)),
age_range = sum(!is.na(range)),
coordinate = sum(!is.na(year))) %>%
tidyr::gather(match, counts, -interval)
plotter <- ggplot() +
geom_bar(data = locations,
aes(x = interval, y = counts, fill = match),
stat='identity', color = 'gray50',
orientation='x', alpha = 0.3,
position='dodge') +
geom_smooth(data = locations,
aes(x = interval, y = counts, color = match),
size=2, se=FALSE,
method='gam', method.args = list(family = "poisson"), formula=y~s(x, k = 5)) +
viridis::scale_color_viridis('Regex Match', discrete = TRUE) +
viridis::scale_fill_viridis('Regex Match', discrete = TRUE) +
ggnewscale::new_scale("fill") +
geom_rect(data = position,
aes(fill = name, xmin = xmin, ymin = ymin, xmax = xmax, ymax = ymax),
color = 'gray50') +
scale_fill_brewer('Paper Element', type = 'qual') +
coord_cartesian(xlim=c(0,1), ylim=c(0, 15), expand = FALSE) +
ggthemes::theme_tufte() +
xlab('Position in Paper (start: 0; end: 1)') +
ylab('Count in Corpus') +
theme(axis.title.x=element_text(size=18, face="bold.italic"),
axis.title.y=element_text(size=18, face="bold.italic"),
axis.text.x=element_text(size=14, face="bold"),
axis.text.y=element_text(size=14, face="bold"))
```
<img src="images/fig5_plot.svg">
**Figure 5**: *Relative location of ages and spatial coordinates reported in GDD documents (i.e., are ages or spatial coordinates generally located at the beginning, middle, or end of a paper?). The line segments above represent the approximate position of the term "Abstract", "Methods", etc. in papers, not the extent of those terms.*
## 8. Conclusions
Here we have provided a model workflow to obtain coordinate and age information associated with IRD deposits, a GitHub Repository for open code development and sharing, and an R toolkit ([http://github.com/EarthCubeGeochron/geodiveR]()). The specific case study showcased here is designed to help researchers study the dynamics of ice sheets over the last 5 million years, while GeoDeepDive and the post-processing analytical workflows shown here will be helpful to those wanting to query for space and time information using GeoDeepDive. This will allow other researchers to import their own data from their own search logic, and output text and coordinates relevant to a researcher's question. The GitHub Repository and R package can act as building blocks that serve researchers not only in the geosciences, but allied disciplines as well.
## Data Availability
All data are publicly available at [http://github.com/EarthCubeGeoChron/IRDDive]() and [http://github.com/EarthCubeGeochron/geodiveR]().
## Acknowledgements
Funding for this research was provided by the National Science Foundation (#EAR-1740694 to Goring, Marcott, Meyers, Peters, Singer, and Williams). The authors also acknowledge I. Ross, J. Czaplewski, A. Zaffos, E. Ito, and J. Husson for helpful discussions.
## Author Contributions
All authors contributed to study design. S.J.G., J.M., and S.Y. performed analysis and developed the GitHub repository and R package. J.M., S.J.G., and S.A.M. wrote the paper with input from the rest of the authors.
## References