index.Rmd

---
title: "Arctic Data Center Semantic Annotation Review, 2020"
knit: (function(input_file, encoding) {
  out_dir <- 'docs';
  rmarkdown::render(input_file,
 encoding=encoding,
 output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
author: "Sam Csik"
date: "10/12/2020"
output: 
  html_document:
    toc: true
    toc_float: true
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, echo = FALSE, warning = FALSE, message = FALSE, results = 'hide'}
##############################
# load packages
##############################

source(here::here("code", "00_libraries.R"))

##############################
# load previous analyses
##############################

source(here::here("code", "02a_initial_exploration.R"))
source(here::here("code", "02b_identifying_ACADIS_data.R"))
source(here::here("code", "02c_exploration_noAttributes.R"))
source(here::here("code", "06_semAnnotation_assessment.R"))
source(here::here("code", "07_nonAnnotated_attribute_assessment.R"))
```

## **Outline:**

  1. ADC semantic annotation overview/goals
  2. Summary of annotations efforts, to date
      * Table 1: annotation counts, listed by ontology/source
      * Table 2^+^: attribute information for non-resolvable valueURIs
  3. Summary of most commonly used semantic annotations
      * Figure 1: most commonly used attribute-level annotations across the ADC corpus
      * Figure 2: most common attribute-level annotations used in data packages uploaded pre vs. post-August 2020
      * Table 3^+^: counts of (i) attribute-level annotations used across the ADC corpus (total, pre-Aug. 2020, post-Aug. 2020), (ii) unique package identifiers each annotation is used in, and (iii) unique authors that used each annotation
  4. Summary of attributes that are not being annotated
      * Figure 3: most common nonannotated attributeNames (unnested, individual tokens)
      * Table 4^+^: most common nonannoated individual attributeName tokens 
  5. Raw data & code

    *^+^downloadable data available below the interactive table*

------

## **1. Overview**

In order to improve data discoverablity within the Arctic Data Center (ADC), the datateam is beginning to incorporate the addition of semantic annotations into the data curation process. Doing so offers a way to standardize the diverse descriptions of data used by researchers across disciplines by attaching terms from controlled vocabularies. The use of semantic annotations provides not only definitions of concepts, but also shows the relationships between different terminology. 

Dr. Steven Chong led the first major effort to implement semantic search within the ADC, beginning in 2017, by building out ontological terms pertaining to carbon cycling. You can read more about Dr. Chong's efforts, the ADC's semantic search product, and its vision moving forward in [this blog post](https://arcticdata.io/blog/2019/12/improving-information-retrieval-the-arctic-data-center-unveils-new-semantic-search-product/). More recently (as of about August 1, 2020), the datateam began making a second push to add semantic annotations to attributes for all incoming data packages to the ADC.
    
The ADC datateam is currently instructed to add annotations from four main ontologies (the following text was borrowed from the [NCEAS Datateam Training](https://nceas.github.io/datateam-training/training/editing-eml.html#semantic-annotations)):

* [The Ecosystem Ontology (ECSO)](http://bioportal.bioontology.org/ontologies/ECSO/?p=summary)
    * this was developed at NCEAS, and has many terms that are relevant to ecosystem processes, especially those involving carbon and nutrient cycling
* [The Environment Ontology (ENVO)](http://bioportal.bioontology.org/ontologies/ENVO/?p=summary)
    * this is an ontology for the concise, controlled description of environments
* [National Center for Biotechnology Information (NCBI) Organismal Classification (NCBITAXON)](http://bioportal.bioontology.org/ontologies/NCBITAXON/?p=summary)
    * The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases.
* [Information Artifact Ontology (IAO)](http://bioportal.bioontology.org/ontologies/IAO/?p=summary)
    * this ontology contains terms related to information entities (eg: journals, articles, datasets, identifiers)
    
Here, I explore the current ADC corpus to summarize our progress in implementing semantic search and identify areas for improvement/further consideration.

------

## **2. Summary of annotation efforts**

**How many datapackages have annotations?**

* As of October 12, 2020, the Arctic Data Center contains **`r num_total_datapackages`** data packages (NOTE: a data package consists of a publically-available metadata record, which may be packaged with one or more data files). Of those, **`r num_datapackages_with_attributes`** contain data file types that have associated attributes (i.e. variables). Currently, **`r num_datapackages_with_annotations`** data packages have at least one semantically-annotated attribute. See section 5 for an overview of datasets that do not contain any attribute information.

**How many attributes have been annotated, and when were these added?**

* The majority of attributes in those **`r num_datapackages_with_annotations`** datapackages are annotated (**`r tot_num_annotated_attributes`/`r tot_num_attributes`**), most of which were added during Dr. Chong's tenure at the ADC (**`r tot_num_annotations_preAug2020`/`r tot_num_annotated_attributes`**, as compared to the **`r tot_num_annotations_postAug2020`/`r tot_num_annotated_attributes`** that have been added by the datateam since August 2020).

**Which ontology(ies) are the majority of annotations coming from?**

* The *vast* majority of semantic annotations come from The Ecosystem Ontology ([ECSO](https://bioportal.bioontology.org/ontologies/ECSO)) (**`r tot_ECSO_annotations`/`r tot_num_annotated_attributes`**, or **`r perc_ECSO`%**). The remaining come from [CHEBI](https://bioportal.bioontology.org/ontologies/CHEBI), [ENVO](https://bioportal.bioontology.org/ontologies/ENVO), and [Wikipedia](https://en.wikipedia.org/wiki/Main_Page). See details in table below.

```{r, echo = FALSE, warning = FALSE, message = FALSE}
datatable(Table1_annotations_counts,
          class = 'cell-border stripe',
          colnames = c("Ontology/Source", "Total # of Annotations", "# of Unique Annotations"),
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: left;', 'Table 1: ', 
            htmltools::em('Ontologies/sources of attribute-level annotations present in the ADC corpus. Total annotation counts and counts of unique annotations are listed for each source.')),
          options = list(pageLength = 5, autoWidth = TRUE))
```

**Which annotations are non-resolvable?**

* Notice that 110 annotated attributes in the ADC corpus have non-resolvable valueURIs. Among those are only three unique valueURIs, listed here:
    * [http://www.purl.dataone.org/odo/ECSO_00010077](http://www.purl.dataone.org/odo/ECSO_00010077)
    * [http://www.purl.dataone.org/odo/ECSO_00010114](http://www.purl.dataone.org/odo/ECSO_00010114)
    * [http://www.purl.dataone.org/odo/ECSO_00010076](http://www.purl.dataone.org/odo/ECSO_00010076)

See additional details regarding these three non-resolvable URIs in **Table 2**, below:    
```{r, echo = FALSE, warning = FALSE, message = FALSE}
datatable(nonResolve_annotations,
          class = 'cell-border stripe',
          colnames = c("package identifier", "valueURI", "attributeName", "attributeDefinition"),
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: left;', 
            htmltools::em('Table 2: Attribute information associated with non-resolvable valueURIs')),
          filter = 'top', options = list(pageLength = 5, autoWidth = TRUE))
```
<span style="color: darkred;">**NOTE:**</span> You can download **Table 2** as a .csv file [here](https://drive.google.com/file/d/1pub4pc3JaJwyYD3nHx1nl9XyZ-9Trand/view?usp=sharing)

------

## **3. Which semantic annotations are most commonly used at the attribute level?**

The most common semantic annotations used across all ACD metadata records are visualized in **Figure 1**, below (for sake of space, only terms used more than 20 times are included in Fig.1). These include terms such as *soil temperature* (used a total of 439 times), *relative species abundance* (used 305 times), and *air temperature* (used 283 times). You can explore (and download) the associated data file containing *all* semantic annotations (i.e. not just those used >20 times) currently included in ADC metadata records in **Table 3**. Dropping the valueURI into your web browswer will take you to the semantic annotation, where you can learn more about its description and relationship to other terms.
```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 16, fig.width = 12}
##############################
# FIGURE 1: all semantic annotations added to attributes (>20 n)
##############################

semAnnotation_freq_plot
```

I've broken this down a bit further into annotations added prior to August 1, 2020 (i.e. those that were added as a part of Dr. Chong's efforts; **Fig.2a**) vs. those added on or after August 1, 2020 (i.e. the more recent additions made by the ADC datateam since incorporating annotations into the data curation workflow; **Fig.2b**). For example, *soil temperature* (the most frequently used annotation overall; see **Fig.1**), was primarily assigned to attributes during Dr. Chong's efforts and has been used less frequently in the more recent annotation efforts. You can find these corresponding values in **Table 3**, in the `pre-2020-08-01 counts` and `post-2020-08-01 counts` columns.

```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 16, fig.width = 20}
##############################
# FIGURE: 
##############################

prepostAug2020_freq_plot
```

```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 24, fig.width = 20}
##############################
# TABLE 3: all semantic annotations added to attributes
##############################

datatable(Table3_annotation_counts, 
          class = 'cell-border stripe',
          colnames = c("semantic annotation (preferred name)", "semantic annotation (valueURI)", "total counts", "pre-2020-08-01 counts", "post-2020-08-01 counts", "unique package identifiers", "unique authors"),
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: left;', 'Table 3: ', 
            htmltools::em('Most common semantic terms used to annotate datapackage attributes in the ADC corpus. Included are the total counts of each annotation (as well as the number of times each annotation was used pre- and post-Aug. 2020), the number of unique package identifers that that annotation is found in, and the number of unique authors who used that annotation. These data may be important for determining which annotations are (i) being broadly used many times (e.g. "latitude coordinate"), (ii) being used many times but in just one/few different data packages (e.g. "bird count") or (iii) used many times by only one/few author(s) across many data packages (e.g. "pressure measurement type"); may be indicative of nested data packages')),
          filter = 'top', options = list(pageLength = 5, autoWidth = TRUE))
```
<span style="color: darkred;">**NOTE:**</span> You can download **Table 3** as a .csv file [here](https://drive.google.com/file/d/110laRxGBGC0l3AYTzkVD20D8ptxgrEU0/view?usp=sharing).

-----

## **4. Which attributes are not getting annotated?**

While annotating each and every attribute within a data package is an ultimate goal, it is currently a time-intensive process for the datateam. As such, datateam members target the most "semantically-important" attributes to annotate and leave "less-important" attributes (terms that are less likely to be searched on; e.g. datetime, latitude, longitude) unannotated. 

Recall that there are **`r num_datapackages_with_annotations`** ADC data packages containing annotations, and across those packages, a total of **`r tot_num_attributes`** attributes. While part 3 explores the **`r tot_num_annotated_attributes`** attributes that have been semantically annotated, here I summarize the remaining **`r tot_num_nonannotated_attributes`** attributes that did not receive annotations.

A goal is to assess if these non-annotated attributes are (a) terms that datateam members are intentionally skipping for sake of time (e.g. "less semantically important" attributes), (b) terms that really should have an annotation and got skipped accidentally, or (c) terms that got skipped because there is currently no appropriate semantic annotation to describe them.

The most common non-annotated attributeNames (individual tokens) are visualized in **Figure 3** and explored in **Table 4** below. **IMPORTANT NOTE: the attributeNames on the y-axis are actually unnested individual tokens (singular words), meaning all attributeNames were separated into individual words (e.g. `sep = " "`) during the text mining process. For example, "depth" (the second most common term; counts = 71), may exist in the ADC corpus as attributeName = "depth", "soil depth", "snow depth", etc. Parsing these will require some further analyses.**

I've manually assigned terms to some general categories (see legend) -- <span style="color: #009E73;">location terms</span> (e.g. "position", "latitude", "top", "site", etc.), <span style="color: #CC79A7;">temporal terms</span> (e.g. "date", "phase"), and <span style="color: #F0E442;">QC/Confidence terms</span> are likely being intentionally skipped for sake of time (see instructions in [Data Team Training Part 4.8.2 #1](https://nceas.github.io/datateam-training/training/editing-eml.html#semantic-annotations)), whereas <span style="color: #E69F00;">measurements</span> (e.g. "depth", "height") and <span style="color: #56B4E9;">environmental materials</span> (e.g. "soil", "snow", "ice") likely have an appropriate annotation match (ECSO & ENVO) and could be annotated. 

```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 15, fig.width = 20}
##############################
# FIGURE: 
##############################

attributeName_pretty

##############################
# table
##############################

datatable(attributeNameIndivTokens,
          class = 'cell-border stripe',
          colnames = c("individual attributeName tokens (single, unnested terms)", "counts", "unique IDs", "unique authors"),
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: left;', 'Table 4: ', 
            htmltools::em('Most common individual attributeName tokens left unannotated (NOTE: a token may be a standalone attributeName (e.g. "depth") or may have been a part of a multi-word attributeName (e.g. "snow depth")). In addition to the total counts of each token, the number of unique package identifers that that token is found in & the number of unique authors who used that token in an attributeName are also included. These data may be important for determining which tokens are (i) being broadly used many times (e.g. "date"), (ii) being used many times but in just a few different data packages (e.g. "position") or (iii) used many times by only one author across many data packages (possibly indicative on nested data packages; e.g. "habitat_type")')),
          filter = 'top', options = list(pageLength = 5, autoWidth = TRUE))
```
<span style="color: darkred;">**NOTE:**</span> You can download **Table 4** as a .csv file [here](https://drive.google.com/file/d/1z7ODVyh_mgC7ze1LlpRa6Ly70CJICwvM/view?usp=sharing).

Additionally, I've asked the datateam to add any attributes that they cannot find an appropriate annotation for to this [GoogleDrive Sheet](https://docs.google.com/spreadsheets/d/1ULEwESDbepuPs2xGKrZMyWYW7BHOLmGi4C8luPgbsrQ/edit?usp=sharing). They're accumulating *very slowly* (which is hopefully a good sign that most attributes have an appropriate annotation match).

-----

## **5. Exploration of datapackages that do not contain attribute information**

In Part 2, we see that the ADC corpus consists of **`r num_total_datapackages`** total datapackages, **`r num_datapackages_with_attributes`** of which have attribute information described in their metadata records according to solr query returns on 2020-10-12. Here, we further explore the remaining **`r num_datapackages_without_attributes`** datapackges that the solr query did not return attribute information for.

First, we identified **'r total_num_ACADIS_datapackages'** [ACADIS](https://www.eol.ucar.edu/field_projects/acadis) datapackages, which were ingested by the ADC in March 2016 and therefore were not curated by (i.e. metadata and associated attribute information was not constructed by) the ADC. After accounting for these datapackages, we are left with **`r num_nonACADIS_data_no_attributes`** datapackages for which attribute information is not returned in solr queries. You can explore these datapackages in **Table 5** below:

```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 15, fig.width = 20}
##############################
# table 5
##############################

datatable(table5,
          class = 'cell-border stripe',
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: left;', 'Table 4: ', 
            htmltools::em('ADC datapackages for which attribute infomration is not returned in solr queries. These do not include any ACADIS datapackages (i.e. those uploaded to the ADC prior to 2020-03-21')),
          filter = 'top', options = list(pageLength = 5, autoWidth = TRUE))
```
<span style="color: darkred;">**NOTE:**</span> You can download **Table 5** as a .csv file [here](https://drive.google.com/file/d/1gnDf8VTknPX8st9C9W7VlYNBYJ9jB5o_/view?usp=sharing).

* **The following subsets of data are confirmed to have no attribute informaton:**
    * `r ks` datapackages by author, Kathleen Stafford, which contain acoustic data
    * `r kn` datapackages by author, Kim Nielsen, which contain airglow image data

* **The following subsets of data have file types which contain attributes, but that attribute information is not present in the metadata records:**
    * `r ip` NABOS II datapackages by author, Igor Polyakov, which have .txt and .csv files containing primarily moorning data
    * `r ds` datapackages by author, Dmitry Streletskiy, which have zip files containing a variety of file formats (image files, .pdf, .xls), some of which (e.g. .xls) have associated attribute information pertaining to thaw depth measurements

-----

## **6. Raw data, code, & other resources**

* Raw data:
    * solr query (2020-10-12) is downloadable [here](https://drive.google.com/file/d/1jQCcfYIDTD8FID40x77ga-2F5MbbPUO1/view?usp=sharing)
    * tidied attribute/annotation information extracted from 2020-10-12 solr query is downloadable [here](https://drive.google.com/file/d/1HE8aD1KJll0xJJc8M7y3LHVBHYm4tixG/view?usp=sharing)
* GitHub Repository with associated code, analyses, and data (this also includes analyses and data not explicitly covered in this report): [samanthacsik/NCEAS-DF-semantic-annoations-review](https://github.com/samanthacsik/NCEAS-DF-semantic-annotations-review)
* GitHub Repository for identifying attributes and their corresponding datapackage identifiers for mass annotation efforts [samannthacsik/NCEAS-DF-Semantics-Project](https://github.com/samanthacsik/NCEAS-DF-Semantics-Project)
*