Skip to content

Commit

Permalink
Update supplement-header.md
Browse files Browse the repository at this point in the history
Hope last of typos
  • Loading branch information
Tasilee authored Nov 1, 2024
1 parent 34bf89c commit 94195a9
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions tg2/_review/build/templates/supplement/supplement-header.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ This document is relevant to curators, aggregators, data publishers, data analys

### 1.3 Associated Documents

This document provides practical information that goes beyond the normative guidance in the [BDQ Core Users Guide](../guide/users/index.md) and [BDQ Core Implemeters Guide](../guide/implementers/index.md) for those who wish to use or implement BDQ Core.
This document provides practical information that goes beyond the normative guidance in the [BDQ Core User's Guide](../guide/users/index.md) and [BDQ Core Implementer's Guide](../guide/implementers/index.md) for those who wish to use or implement BDQ Core.

### 1.4 Status of the content of this document

Expand Down Expand Up @@ -324,17 +324,17 @@ Given an a resource (an occurrence record) list all assertions produced by valid

## 3 Developing the Tests

Originally, Biodiversity Information Standards (TDWG) Data Quality Task Group 2: Data Quality Tests and Assertions was tasked with finding a fundamental suite of tests and identifying any relevant software asociated with testing for 'Data Quality'/'Fitness for Use'. It was quickly realized however, that any software was likely to be far less stable than defining a CORE suite of tests and an associated framework, so the software component was quickly dropped. We also limitied the scope of BDQ Core Tests to apply only to data encoded using the Darwin Core standard (Weiczorek et al. 2012). This gave us a specific target, but also associated problems noted below.
Originally, Biodiversity Information Standards (TDWG) Data Quality Task Group 2: Data Quality Tests and Assertions was tasked with finding a fundamental suite of tests and identifying any relevant software asociated with testing for 'Data Quality'/'Fitness for Use'. It was quickly realized however, that any software was likely to be far less stable than defining a CORE suite of tests and an associated framework, so the software component was quickly dropped. We also limitied the scope of BDQ Core Tests to apply only to data encoded using the Darwin Core standard (Wieczorek et al. 2012). This gave us a specific target, but also associated problems noted below.

Finding out what tests were being used by a range of biodiversity data aggregators was our first step in identifying likely candidates. We identified, aggregated and described 152 unique tests from GBIF, ALA, iDigBio, CRIA and BISON into a consisent structure for comparison and evaluation. Descriptors at this time included Information Elements (the [Darwin Core Terms](https://dwc.tdwg.org/list/) (Darwin Core Maintenance Group 2021) used by the Tests), Specification (a technical description aimed at implementation), Darwin Core Class, Source of the test and References. We peformed a full evaluation of each candidate test using the seven citeria noted in Section 2.1. Tests had to be informative in being able to evaluate or enhance the quality of a data record. Tests had to be relatively simple/straight forward to implement with existing tools. Tests were mandatory for any potential enhancements to record values in that a Validation Test was required before any Amendment Test. Tests required power in that they will not likely result in 0% or 100% of all record passing or failing the test.
Finding out what tests were being used by a range of biodiversity data aggregators was our first step in identifying likely candidates. We identified, aggregated and described 152 unique tests from GBIF, ALA, iDigBio, CRIA and BISON into a consisent structure for comparison and evaluation. Descriptors at this time included Information Elements (the [Darwin Core Terms](https://dwc.tdwg.org/list/) (Darwin Core Maintenance Group 2021) used by the Tests), Specification (a technical description aimed at implementation), Darwin Core Class, Source of the test and References. We performed a full evaluation of each candidate test using the seven citeria noted in Section 2.1. Tests had to be informative in being able to evaluate or enhance the quality of a data record. Tests had to be relatively simple/straight forward to implement with existing tools. Tests were mandatory for any potential enhancements to record values in that a Validation Test was required before any Amendment Test. Tests required power in that they will not likely result in 0% or 100% of all record passing or failing the test.

BDQ Core Tests were designed to provide an adequate coverage of basic information dimensions of Darwin Core: dwc:Taxon (GitHub tag NAME); dwc:Event (GitHub tag TIME) and dcterms:Location (GitHub tag SPACE), and a category that we called "Other" (GitHub tag OTHER) to cover Tests on dwc terms such as dc:license (see Section 3.1). Tests also had to be widely applicable across a range of use cases. Tests that were identified as useful in a limited context were documented and were (GitHub) tagged as "Supplementary" in that they could be implemented by a community of usage.

We originally rendered the Tests in the form that flagged a **FAIL**, for example a dwc:eventDate that did not conform to ISO 8601-1 date. Our reasoning was this strategy aligned with all of the sources of the Tests in that we all sought to identify **issues** with values in the record that would reduce its quality. However, the Data Quality Framework (Veiga 2016, Veiga et al. 2017) worked in the opposite direction: Identifying values in a record that **PASSED** a Test; increased 'quality'. To align with the Framework, we renamed all BDQ Core Tests from FAIL to PASS type, for example, COUNTRYCODE_NOTSTANDARD became COUNTRYCODE_STANDARD. This reversal of 'fail' to 'pass' was also reflected in the comparison of the Framework's 'Data Quality Dimension' versus our early concept of 'Warning Type' (see Section 3.2).

Second and subsequent evaluations of the candidate BDQ Core Tests reduced the number to about 100 that seemed to fulfil the criteria above. Tests came and went as we provided more consistent and comprehensive documentation against what we called the Test Descriptors. The Tests also changed as we began to implement them. We modified a Test Specification to then find that we would not be able to implement it due to potential ambiguous responses from the Test or that a Test response may be misleading. By far the greatest changes to the candidate tests came about when we implemented them and ran them against the Test Validation Data (see the [Implementer's Guide](../guide/implementers/index.md#81-introduction-non-normative)).

At one point, we aligned the documetation for over sixty tests that were tagged in GitHub as Supplementary, Immature/Incomplete and DO NOT IMPLEMENT. In doing so, we realized that the consistent documentation now provided a more nuanced evaluation and subsequently moved a number of these tests back into BDQ Core. The opposite was also true: The implementation of the Tests and running against the validation test data clearly demonstrated that some Tests were removed from BDQ Core. Where there were recognized nuances with the Tests that may not be obvious from the Specification, we documented the issues in the Test Notes.
At one point, we aligned the documentation for over sixty tests that were tagged in GitHub as Supplementary, Immature/Incomplete and DO NOT IMPLEMENT. In doing so, we realized that the consistent documentation now provided a more nuanced evaluation and subsequently moved a number of these tests back into BDQ Core. The opposite was also true: The implementation of the Tests and running against the validation test data clearly demonstrated that some Tests were removed from BDQ Core. Where there were recognized nuances with the Tests that may not be obvious from the Specification, we documented the issues in the Test Notes.

The team identified a fundamental problem early in the development of the Tests: Darwin Core lacked a comprehensive suite of controlled vocabularies. Testing for 'quality' or 'fitness for use' was made difficult at best and impossible at worst, when controlled vocabularies were unavailable. We recognized the key issue of openness of the Darwin Core standard, yet the need for controlled vocabularies to evaluate and improve the 'quality' of Darwin Core encoded data was also important. This conclusion effectively initiated Data Quality Task Group 4: Best Practices for Development of Vocabularies of Value (https://www.tdwg.org/community/bdq/tg-4/) to provide a framework for how these vocabularies could be developed for a priority set of Darwin Core Terms. This in turn has resulted in GBIF initiating https://github.com/gbif/vocabulary, and see also https://docs.google.com/viewer?url=https%3A%2F%2Fdev.gbif.org%2Fissues%2Fbrowse%2FGBIF-121%2Fgbif-vocabularies-review_v01.docx.

Expand Down

0 comments on commit 94195a9

Please sign in to comment.