Skip to content

Commit

Permalink
Merge pull request nationalparkservice#123 from RobLBaker/main
Browse files Browse the repository at this point in the history
add test_missing_data
  • Loading branch information
RobLBaker authored Jan 11, 2024
2 parents a32fee3 + 053b3a3 commit b096e06
Show file tree
Hide file tree
Showing 22 changed files with 305 additions and 46 deletions.
1 change: 0 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ Imports:
crayon,
httr,
jsonlite,
stats,
QCkit,
lifecycle
Remotes:
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ export(test_keywords)
export(test_license)
export(test_metadata_version)
export(test_methods)
export(test_missing_data)
export(test_notes)
export(test_numeric_fields)
export(test_orcid_exists)
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# DPchecker development version

* Bugfix attempt for `test_fields_match()` reportedly needs more testing
* Add function `test_missing_data()` which scans data for NAs not documented in metadata

# DPchecker 0.3.3

* Bug fixes for `test_date_range()` and `test_dates_parse()`.
* Adjusted `test_datatable_urls()` and `test_datatable_urls_doi()` so that they work properly when there are no data table urls present in metadata.
* Move convert_datetime_format to QCkit; add QCkit as re-export to DPchecker
Expand Down
9 changes: 9 additions & 0 deletions R/run_checks.R
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,15 @@ run_congruence_checks <- function(directory = here::here(),
warn_count <<- warn_count + 1
cli::cli_bullets(c(w$message, w$body))
})
tryCatch(test_missing_data(directory, metadata),
error = function(e) {
err_count <<- err_count + 1
cli::cli_bullets(c(e$message, e$body))
},
warning = function(w) {
warn_count <<- warn_count + 1
cli::cli_bullets(c(w$message, w$body))
})
tryCatch(test_numeric_fields(directory, metadata),
error = function(e) {
err_count <<- err_count + 1
Expand Down
75 changes: 73 additions & 2 deletions R/tabular_data_congruence.R
Original file line number Diff line number Diff line change
Expand Up @@ -387,7 +387,7 @@ test_datatable_urls <- function (metadata = load_metadata(directory)) {
return(invisible(metadata))
}

#' Tests for data table URL formatting & correspondance with DOI
#' Tests for data table URL formatting & correspondence with DOI
#'
#' @description `test_datatable_urls_doi()` passes if all data tables have URLs that are properly formatted (i.e. "https://irma.nps.gov/DataStore/Reference/Profile/xxxxxxx") where "xxxxxx" is identical to the DOI specified in the metadata. Fails with a warning if there is no DOI specified in metadata. If a DOI is specified in metadata, but the data table URL does not properly coincide with the url for the landing page that the doi points to for any one table, the test fails with a warning (and indicates which table failed). If data table urls do not exist, fails with an error and indicates how to add them.
#'
Expand Down Expand Up @@ -584,6 +584,78 @@ test_fields_match <- function(directory = here::here(), metadata = load_metadata
return(invisible(metadata))
}

#' Looks for undocumented missing data (NAs)
#'
#' @description `test_missing_data` scans the data package for common missing data specified as NA. If there are no missing data (NAs) or if all NAs are documented as missing data in the metadata, the test passes. If missing data are found but not documented in the metadata the test fails with an error.
#'
#' Commonly, R will interpret blank cells as missing and fill in NA. To pass this test, you will need to either delete columns with missing data (if they are completely blank) or add NA as a missing data code during metadata creation.
#'
#' This is a fairly simple test and ONLY checks for NA. Although there are many common missing data codes (-99999, "Missing", "NaN" etc) we cannot anticipate all of them.
#'
#' Why is it important to document missing data? If a user wants to use your data and some of it is missing without an explanation or acknowledgement, the user cannot trust any of the data in your data package to be complete.
#'
#' @inheritParams load_data
#' @inheritParams test_metadata_version
#'
#' @return Invisibly returns `metadata`.
#' @export
#' @examples
#' \dontrun{
#' test_missing_data(directory = here::here(),
#' metaata = load_metadata(directory))
#' }
test_missing_data <- function(directory = here::here(),
metadata = load_metadata(directory)) {
is_eml(metadata) # Throw an error if metadata isn't an emld object

# get dataTable and all children elements
data_tbl <- EML::eml_get(metadata, "dataTable")
data_tbl$`@context` <- NULL
# If there's only one csv, data_tbl ends up with one less level of nesting. Re-nest it so that the rest of the code works consistently
if ("attributeList" %in% names(data_tbl)) {
data_tbl <- list(data_tbl)
}
# get a list of the data files
data_files <- list.files(path = directory, pattern = ".csv")

#load files and test for NAs
error_log <- NULL
for (i in 1:length(seq_along(data_files))) {
#load each file
dat <- suppressMessages(readr::read_csv(paste0(directory,
"/",
data_files[i]),
show_col_types = FALSE))
#look in each column in the given file
for (j in 1:ncol(dat)) {
#look for NAs; if NAs found, look for correct missing data codes
if (sum(is.na(dat[,j])) > 0) {
missing <- data_tbl[[i]][["attributeList"]][["attribute"]][[j]][["missingValueCode"]][["code"]]
if(is.null(missing) || ("NA" != missing)) {
error_log <- append(error_log,
paste0(" ",
"---> {.file ",
data_files[i],
"} {.field ",
names(dat)[j],
"} contains missing data without a corresponding missing data code in metadata." ))
}
}
}
}
if(is.null(error_log)){
cli::cli_inform(c("v" = "Missing data listed as NA is accounted for in metadata"))
}
else{
# really only need to say it once per file/column combo
msg <- error_log
names(msg) <- rep(" ", length(msg))
err <- paste0("Undocumented missing data detected. Please document all missing data in metadata:\n")
cli::cli_abort(c("x" = err, msg))
}
return(invisible(metadata))
}

#' Test Numeric Fields
#'
#' @description `test_numeric_fields()` verifies that all columns listed as numeric in the metadata are free of non-numeric data. If non-numeric data are encountered, the test fails with an error.
Expand Down Expand Up @@ -682,7 +754,6 @@ test_numeric_fields <- function(directory = here::here(), metadata = load_metada
return(invisible(metadata))
}


#' Test data and metadata data formats match
#'
#' @description `test_dates_parse()` will examine all data columns that are described as containing dates and times. Although it can handle multiple different formats, the ISO-8601 format for dates and times is HIHGLY recommended (ISO is YYYY-MM-DDThh:mm:ss or just YYYY-MM-DD). The function will compare the format provided in the data files to the format indicated in metadata. If there are no dates indicated in the metadata, the test fails with a warning. If there are dates and the formats match, the test passes. If the formats do not match, the test fails with an error. The specific files and columns that failed are indicated in the results.
Expand Down
1 change: 1 addition & 0 deletions docs/articles/DPchecker.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions docs/authors.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion docs/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 9 additions & 1 deletion docs/news/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ pkgdown: 2.0.7
pkgdown_sha: ~
articles:
DPchecker: DPchecker.html
last_built: 2023-12-20T15:06Z
last_built: 2024-01-11T22:24Z

4 changes: 2 additions & 2 deletions docs/reference/DPchecker_example.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion docs/reference/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions docs/reference/run_congruence_checks.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit b096e06

Please sign in to comment.