add data back to fleet #675

Andrea-Havron-NOAA · 2024-10-11T17:11:10Z

What is the feature?

add data back to fleet

How have you implemented the solution?

add data objects to fleet
add data error checks in information
add get and set id functions to fleet interface and expose in rcpp_interface
update demo and tests
add distribution helper functions and tests
add tests to fleet interface

Does the PR impact any other area of the project, maybe another repo?

tests and demo needed to be changed

github-actions · 2024-10-11T17:11:25Z

Instructions for code reviewer

Hello reviewer, thanks for taking the time to review this PR!

Please use this checklist during your review, checking off items that you have verified are complete!
For PRs that don't make changes to code (e.g., changes to README.md or Github actions workflows), feel free to skip over items on the checklist that are not relevant. Remember it is still important to do a thorough review.
Then, comment on the pull request with your review indicating where you have questions or changes need to be made before merging.
Remember to review every line of code you’ve been asked to review, look at the context, make sure you’re improving code health, and compliment developers on good things that they do.
PR reviews are a great way to learn, so feel free to share your tips and tricks. However, for optional changes (i.e., not required for merging), please include nit: (for nitpicking) before making the suggestion. For example, nit: I prefer using a data.frame() instead of a matrix because...
Engage with the developer when they respond to comments and check off additional boxes as they become complete so the PR can be merged in when all the tasks are fulfilled. Make it clear when this has been reached by commenting on the PR with something like This PR is now ready to be merged, no changes needed.

Checklist

The PR is requested to be merged into the appropriate branch (typically main)
The code is well-designed.
The functionality is good for the users of the code.
Any User Interface changes are sensible and look good.
The code isn’t more complex than it needs to be.
Code coverage remains high, indicating the new code is tested.
The developer used clear names for everything.
Comments are clear and useful, and mostly explain why instead of what.
Code is appropriately documented (doxygen and roxygen).

Andrea-Havron-NOAA · 2024-10-11T17:12:13Z

@msupernaw, can you check that the data is set up in fleet correctly?

@Bai-Li-NOAA, can you review the new helper functions for distributions?

R/distribution_formulas.R

kellijohnson-NOAA · 2024-10-15T04:07:45Z

R/distribution_formulas.R

+#' @param sd A list of length two. The first entry, `"value"`, stores the
+#'   initial values for the relevant standard deviations. The second entry,
+#'   `"estimated"` is a vector of booleans indicating whether or not
+#'   standard deviation is estimated.


It is not clear if estimated should be a vector if sd is a vector with a length greater than one.

Maybe check if the user provide both value and estimated, and specify default values in the Roxygen documentation?

added to documentation

R/distribution_formulas.R

kellijohnson-NOAA · 2024-10-15T04:11:13Z

R/distribution_formulas.R

+    if (family$link == "log") {
+      expected <- "log_expected_index"
+    }
+    if (family$link == "identity") {
+      expected <- "expected_index"
+    }


Should it only say index if it is index data?

"log_expected_index" and "expected_index" are internal names in fleet for the expected values calculated in population

kellijohnson-NOAA · 2024-10-15T04:23:09Z

R/distribution_formulas.R

+    if (family$link == "log") {
+      expected <- "log_expected_index"
+    }
+    if (family$link == "identity") {


I forgot to change all of the family$link to family[["link"]]

kellijohnson-NOAA · 2024-10-15T04:26:57Z

R/distribution_formulas.R

+  families <- c("lognormal", "gaussian")
+  if (family[["family"]] == "normal") {
+    stop("use family = gaussian() instead")
+  }
+  if (!(family[["family"]] %in% families)) {
+    stop("FIMS currently does not offer this distribution for processes.")
+  }


It would be good to create a helper function here and use it for this and the warnings in the previous function instead of having duplicated code.

created helper function to validate user input

R/distribution_formulas.R

kellijohnson-NOAA · 2024-10-15T04:32:05Z

R/distribution_formulas.R

+#' \item{mu.eta}{
+#'   TODO: document mu.eta
+#'   function: derivative \eqn{TODO}.
+#' }


Needs more documentation.

Also, this is a REAL BIG picture comment. {sdmTMB} also has similar functions in families.R. It would be great if we could create a package to store these where both {sdmTMB}, {FIMS}, and any other package that wants to could use them without needing to require a complex package.

kellijohnson-NOAA · 2024-10-15T04:32:41Z

R/distribution_formulas.R

+#' Multinomial family and link specification
+#'
+#' @param link link function association with family
+#' @return An object of class "family"


The full list that is returned needs to be documented.

kellijohnson-NOAA · 2024-10-15T04:32:59Z

R/distribution_formulas.R

+
+#' Multinomial family and link specification
+#'
+#' @param link link function association with family


Needs to be a complete sentence or copied from [lognormal()] using @inheritParams

msupernaw

I'm good with these changes. We'll need to give some attention in M2Q to the derived quantity calculations and only calculate what is needed given the provided data. I have concerns about the parallel test for MacOS. I'm assuming that it's unrelated to these changes?

R/distribution_formulas.R

Bai-Li-NOAA · 2024-10-16T13:35:34Z

R/distribution_formulas.R

+) {
+  data_type <- match.arg(data_type)
+  families <- c("lognormal", "gaussian", "multinomial")
+  if (family[["family"]] == "normal") {


Do we want to validate family structure before indexing into family[["family"]] and family[["link"]]? For example:

if (!all(c("family", "link") %in% names(family))) { stop("Family must contain both 'family' and 'link' entries.") }

Each family has a default link function, which is the canonical link function based on the exponential family definition. If no link function is provided, the default argument is used, which is based on the structure of the family class from the stats package.

but we should probably verify that the family argument is a class type of family.

Bai-Li-NOAA · 2024-10-16T13:35:43Z

R/distribution_formulas.R

+  families <- c("lognormal", "gaussian", "multinomial")
+  if (family[["family"]] == "normal") {
+    stop("use family = gaussian() instead")
+  }


The checks are clearly laid out and help prevent invalid configurations! May consider using cli styling, as shown in the examples here.

updated to use cli styling, thanks for the suggestion!

Bai-Li-NOAA · 2024-10-16T13:35:51Z

R/distribution_formulas.R

+    new_module <- new(TMBDlnormDistribution)
+    new_module$log_logsd <- new(
+      ParameterVector,
+      log(sd$value),


Do we want to add a check for sd$value and ensure it's a positive number?

added check

But the check does not capture the case when they are both greater than one but of different lengths.

The check_distribution_validity function throws an error in this case. See line 49 of this file.

Bai-Li-NOAA · 2024-10-16T13:35:58Z

R/distribution_formulas.R

+      length(sd$value)
+    )
+    new_module$log_logsd$set_all_estimable(sd$estimated)
+    if (family[["link"]] == "log") {


The code occurs twice in the function (see Lines 64-69). Maybe make this into a helper function?

created helper function

R/distribution_formulas.R

Bai-Li-NOAA · 2024-10-16T13:36:33Z

R/distribution_formulas.R

+#' fam[["family"]]
+#' fam$link
+lognormal <- function(link = "log") {
+  r <- list(family = "lognormal")


The names r and f could be more descriptive.

Bai-Li-NOAA · 2024-10-16T13:36:40Z

R/distribution_formulas.R

+#' fam[["family"]]
+#' fam$link
+multinomial <- function(link = "logit") {
+  r <- list(family = "multinomial")


The names r and f could be more descriptive.

Bai-Li-NOAA · 2024-10-16T13:47:30Z

tests/testthat/test-distribution-formulas.R

+  fishing_fleet_index_distribution <-
+    new_data_distribution(data_type = "cpue", module = fishing_fleet,
+                        family = lognormal(link = "log"),
+                        sd = list(value = rep(sqrt(log(em_input$cv.L$fleet1^2 + 1)), om_input$nyr),


This code has been repeated several places. Maybe extract it into a separate variable.

Bai-Li-NOAA · 2024-10-16T13:52:19Z

I have reviewed R/distribution_formulas.R and the R tests. The functions have good error handling. One thing to consider is refactoring some of the repeated code.

Andrea-Havron-NOAA · 2024-10-18T16:04:55Z

I've addressed all the comments. Thanks for all the great feedback, I think the helper functions are more readable now! @kellijohnson-NOAA and @Bai-Li-NOAA, let me know if you approve the changes or have any follow-up comments.

kellijohnson-NOAA · 2024-10-18T18:15:02Z

R/distribution_formulas.R

+#' Validaity checks for new_data_distribution and new_process_distribution
+#' 
+check_distribution_validity <- function(args){
+  list2env(args, envir = environment())


Just an FYI this will lead to warning checks on CRAN, not that we are trying for CRAN right now, because there is no way to know what is present in "args".

good catch! I like to try to make the code as CRAN ready as possible so we have less to fix later if we ever decide to submit.

removed use of list2env

kellijohnson-NOAA · 2024-10-18T18:16:45Z

R/distribution_formulas.R

+  if(data_type == "index" || data_type == "cpue"){
+    if(family[["family"]] == "lognormal" || family[["family"]] == "gaussian"){
+      if(family[["link"]] == "log"){
+        expected_name <- "log_expected_index"
+      } 
+      if(family[["link"]] == "identity"){
+        expected_name <- "expected_index"
+      } 
+    }
+  }
+  if(data_type == "agecomp" || data_type == "lengthcomp"){
+    expected_name <- "proportion_catch_numbers_at_age"
+  }


Only if statements are present, no else statement. So what happens if you do not fit inside an if statement?

data_type uses match_args to set the value from the subset: index, cpue, agecomp, lengthcomp. A different data type will throw an error before this part of the code is run. Also, the default expected_name is NA, so technically that is the returned value if input does not fit inside the if statements.

I just read that match_arg uses partial matching, so I added an additional validity check to make sure data_type is one of the four options available. I also realized we don't have checks on link functions. sdmTMB adds these checks to the family functions themselves. I added a check to throw an error if the expected_value is still NA while we work out where to put checks on link functions.

kellijohnson-NOAA · 2024-10-22T17:45:37Z

@Andrea-Havron-NOAA did you want to rebase this to dev while you are doing the changes to cli::*?

kellijohnson-NOAA · 2024-10-22T18:01:33Z

R/distribution_formulas.R

+      ))
+    }
+
+    if ((data_type == "agecomp" || data_type == "lengthcomp") &&


I think we can change this to if (grepl("comp", data_type)) && to be more generic.

kellijohnson-NOAA · 2024-10-22T18:04:59Z

R/distribution_formulas.R

+  }
+  if(!is.null(args$data_type)){
+    data_type <- args$data_type
+    data_type_names <- c("index", "cpue", "agecomp", "lengthcomp")


Why is "cpue" an option here?

I wasn't sure what descriptior to use here... should it be landings, or should index be used to describe both fleet and survey data?

Andrea-Havron-NOAA · 2024-10-22T19:51:57Z

@Andrea-Havron-NOAA did you want to rebase this to dev while you are doing the changes to cli::*?

yes, I can work on this rebase.

Andrea-Havron-NOAA · 2024-10-22T21:50:29Z

@Andrea-Havron-NOAA did you want to rebase this to dev while you are doing the changes to cli::*?

yes, I can work on this rebase.

I have addressed the most recent edits, sqashed all commits, and rebased with dev

kellijohnson-NOAA · 2024-10-23T13:06:51Z

@Andrea-Havron-NOAA sorry to leave this PR open for so long but I have one more major question ... should we fix the distributions per @Bai-Li-NOAA's comments that they should be more similar, e.g., take an sd argument rather than log_logsd, etc. in this PR? Then, I have a minor question, mainly for @Bai-Li-NOAA should the functions that set up the distributions be named setup_process_distribution() and setup_data_distribution() instead of new_*() to match the functions that you are creating?

Bai-Li-NOAA · 2024-10-23T13:30:11Z

@kellijohnson-NOAA, I'm going to chat with Andrea about the distributions today. Along with making the distribution arguments more similar, I also want to figure out which fields of the distribution object should be accessible to users.

For the R function names, please feel free to leave them as they are for now. I’ll handle the refactoring later in the branch I’m working on. I’m thinking about using initialize_*() instead of setup_*() if the function involves using methods::new() to initialize an object.

kellijohnson-NOAA · 2024-10-23T15:21:04Z

Ahh thanks for the insight @Bai-Li-NOAA 😃 I am guessing that @Andrea-Havron-NOAA started their naming with new_* because they use methods::new() but I didn't put it together 😕. I will wait until after your meeting with @Andrea-Havron-NOAA to decide if more refactoring should be done here or if this should be merged.

Andrea-Havron-NOAA · 2024-10-23T19:23:27Z

@Andrea-Havron-NOAA sorry to leave this PR open for so long but I have one more major question ... should we fix the distributions per @Bai-Li-NOAA's comments that they should be more similar, e.g., take an sd argument rather than log_logsd, etc. in this PR? Then, I have a minor question, mainly for @Bai-Li-NOAA should the functions that set up the distributions be named setup_process_distribution() and setup_data_distribution() instead of new_*() to match the functions that you are creating?

@kellijohnson-NOAA, @Bai-Li-NOAA and I think we need the changes from both this PR and her branch to move forward with standardizing distribution arguments. After this branch gets rebased with dev, she can rebase her branch with dev and I can create a new branch off of hers to work on these changes.

* add data error checks in information * add get and set id functions to fleet interface and expose in rcpp_interface * update demo and tests * add distribution helper functions and tests * add tests to fleet interface * Fix formatting for tidyverse style. * Increase some of the documentation to better explain parameters. * Share argument documentation across functions. * Use match.arg(). * use && and || * add documentation * add examples * add check validity function * update error message to use cli formatting * add new tests * helper function for expected names

Andrea-Havron-NOAA requested review from msupernaw and Bai-Li-NOAA October 11, 2024 17:11