Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add data back to fleet #675

Merged
merged 1 commit into from
Oct 23, 2024
Merged

add data back to fleet #675

merged 1 commit into from
Oct 23, 2024

Conversation

Andrea-Havron-NOAA
Copy link
Collaborator

What is the feature?

  • add data back to fleet

How have you implemented the solution?

  • add data objects to fleet
  • add data error checks in information
  • add get and set id functions to fleet interface and expose in rcpp_interface
  • update demo and tests
  • add distribution helper functions and tests
  • add tests to fleet interface

Does the PR impact any other area of the project, maybe another repo?

  • tests and demo needed to be changed

Copy link
Contributor

Instructions for code reviewer

Hello reviewer, thanks for taking the time to review this PR!

  • Please use this checklist during your review, checking off items that you have verified are complete!
  • For PRs that don't make changes to code (e.g., changes to README.md or Github actions workflows), feel free to skip over items on the checklist that are not relevant. Remember it is still important to do a thorough review.
  • Then, comment on the pull request with your review indicating where you have questions or changes need to be made before merging.
  • Remember to review every line of code you’ve been asked to review, look at the context, make sure you’re improving code health, and compliment developers on good things that they do.
  • PR reviews are a great way to learn, so feel free to share your tips and tricks. However, for optional changes (i.e., not required for merging), please include nit: (for nitpicking) before making the suggestion. For example, nit: I prefer using a data.frame() instead of a matrix because...
  • Engage with the developer when they respond to comments and check off additional boxes as they become complete so the PR can be merged in when all the tasks are fulfilled. Make it clear when this has been reached by commenting on the PR with something like This PR is now ready to be merged, no changes needed.

Checklist

  • The PR is requested to be merged into the appropriate branch (typically main)
  • The code is well-designed.
  • The functionality is good for the users of the code.
  • Any User Interface changes are sensible and look good.
  • The code isn’t more complex than it needs to be.
  • Code coverage remains high, indicating the new code is tested.
  • The developer used clear names for everything.
  • Comments are clear and useful, and mostly explain why instead of what.
  • Code is appropriately documented (doxygen and roxygen).

@Andrea-Havron-NOAA
Copy link
Collaborator Author

@msupernaw, can you check that the data is set up in fleet correctly?

@Bai-Li-NOAA, can you review the new helper functions for distributions?

Comment on lines 10 to 13
#' @param sd A list of length two. The first entry, `"value"`, stores the
#' initial values for the relevant standard deviations. The second entry,
#' `"estimated"` is a vector of booleans indicating whether or not
#' standard deviation is estimated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear if estimated should be a vector if sd is a vector with a length greater than one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check if the user provide both value and estimated, and specify default values in the Roxygen documentation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to documentation

Comment on lines 52 to 184
if (family$link == "log") {
expected <- "log_expected_index"
}
if (family$link == "identity") {
expected <- "expected_index"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it only say index if it is index data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"log_expected_index" and "expected_index" are internal names in fleet for the expected values calculated in population

if (family$link == "log") {
expected <- "log_expected_index"
}
if (family$link == "identity") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to change all of the family$link to family[["link"]]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 101 to 107
families <- c("lognormal", "gaussian")
if (family[["family"]] == "normal") {
stop("use family = gaussian() instead")
}
if (!(family[["family"]] %in% families)) {
stop("FIMS currently does not offer this distribution for processes.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to create a helper function here and use it for this and the warnings in the previous function instead of having duplicated code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created helper function to validate user input

Comment on lines 153 to 156
#' \item{mu.eta}{
#' TODO: document mu.eta
#' function: derivative \eqn{TODO}.
#' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs more documentation.

Also, this is a REAL BIG picture comment. {sdmTMB} also has similar functions in families.R. It would be great if we could create a package to store these where both {sdmTMB}, {FIMS}, and any other package that wants to could use them without needing to require a complex package.

#' Multinomial family and link specification
#'
#' @param link link function association with family
#' @return An object of class "family"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full list that is returned needs to be documented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#' Multinomial family and link specification
#'
#' @param link link function association with family
Copy link
Contributor

@kellijohnson-NOAA kellijohnson-NOAA Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be a complete sentence or copied from [lognormal()] using @inheritParams

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@msupernaw msupernaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with these changes. We'll need to give some attention in M2Q to the derived quantity calculations and only calculate what is needed given the provided data. I have concerns about the parallel test for MacOS. I'm assuming that it's unrelated to these changes?

) {
data_type <- match.arg(data_type)
families <- c("lognormal", "gaussian", "multinomial")
if (family[["family"]] == "normal") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to validate family structure before indexing into family[["family"]] and family[["link"]]? For example:

if (!all(c("family", "link") %in% names(family))) {
    stop("Family must contain both 'family' and 'link' entries.")
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each family has a default link function, which is the canonical link function based on the exponential family definition. If no link function is provided, the default argument is used, which is based on the structure of the family class from the stats package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we should probably verify that the family argument is a class type of family.

families <- c("lognormal", "gaussian", "multinomial")
if (family[["family"]] == "normal") {
stop("use family = gaussian() instead")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks are clearly laid out and help prevent invalid configurations! May consider using cli styling, as shown in the examples here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use cli styling, thanks for the suggestion!

new_module <- new(TMBDlnormDistribution)
new_module$log_logsd <- new(
ParameterVector,
log(sd$value),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a check for sd$value and ensure it's a positive number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the check does not capture the case when they are both greater than one but of different lengths.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check_distribution_validity function throws an error in this case. See line 49 of this file.

length(sd$value)
)
new_module$log_logsd$set_all_estimable(sd$estimated)
if (family[["link"]] == "log") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code occurs twice in the function (see Lines 64-69). Maybe make this into a helper function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created helper function

#' fam[["family"]]
#' fam$link
lognormal <- function(link = "log") {
r <- list(family = "lognormal")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names r and f could be more descriptive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#' fam[["family"]]
#' fam$link
multinomial <- function(link = "logit") {
r <- list(family = "multinomial")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names r and f could be more descriptive.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

fishing_fleet_index_distribution <-
new_data_distribution(data_type = "cpue", module = fishing_fleet,
family = lognormal(link = "log"),
sd = list(value = rep(sqrt(log(em_input$cv.L$fleet1^2 + 1)), om_input$nyr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has been repeated several places. Maybe extract it into a separate variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Bai-Li-NOAA
Copy link
Contributor

I have reviewed R/distribution_formulas.R and the R tests. The functions have good error handling. One thing to consider is refactoring some of the repeated code.

@Andrea-Havron-NOAA
Copy link
Collaborator Author

I've addressed all the comments. Thanks for all the great feedback, I think the helper functions are more readable now! @kellijohnson-NOAA and @Bai-Li-NOAA, let me know if you approve the changes or have any follow-up comments.

#' Validaity checks for new_data_distribution and new_process_distribution
#'
check_distribution_validity <- function(args){
list2env(args, envir = environment())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI this will lead to warning checks on CRAN, not that we are trying for CRAN right now, because there is no way to know what is present in "args".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I like to try to make the code as CRAN ready as possible so we have less to fix later if we ever decide to submit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed use of list2env

Comment on lines 72 to 89
if(data_type == "index" || data_type == "cpue"){
if(family[["family"]] == "lognormal" || family[["family"]] == "gaussian"){
if(family[["link"]] == "log"){
expected_name <- "log_expected_index"
}
if(family[["link"]] == "identity"){
expected_name <- "expected_index"
}
}
}
if(data_type == "agecomp" || data_type == "lengthcomp"){
expected_name <- "proportion_catch_numbers_at_age"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if statements are present, no else statement. So what happens if you do not fit inside an if statement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_type uses match_args to set the value from the subset: index, cpue, agecomp, lengthcomp. A different data type will throw an error before this part of the code is run. Also, the default expected_name is NA, so technically that is the returned value if input does not fit inside the if statements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read that match_arg uses partial matching, so I added an additional validity check to make sure data_type is one of the four options available. I also realized we don't have checks on link functions. sdmTMB adds these checks to the family functions themselves. I added a check to throw an error if the expected_value is still NA while we work out where to put checks on link functions.

@kellijohnson-NOAA
Copy link
Contributor

@Andrea-Havron-NOAA did you want to rebase this to dev while you are doing the changes to cli::*?

))
}

if ((data_type == "agecomp" || data_type == "lengthcomp") &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change this to if (grepl("comp", data_type)) && to be more generic.

}
if(!is.null(args$data_type)){
data_type <- args$data_type
data_type_names <- c("index", "cpue", "agecomp", "lengthcomp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is "cpue" an option here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what descriptior to use here... should it be landings, or should index be used to describe both fleet and survey data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"index"

@Andrea-Havron-NOAA
Copy link
Collaborator Author

@Andrea-Havron-NOAA did you want to rebase this to dev while you are doing the changes to cli::*?

yes, I can work on this rebase.

@Andrea-Havron-NOAA Andrea-Havron-NOAA force-pushed the dev-data-to-fleet branch 2 times, most recently from 7f75666 to 7800a8f Compare October 22, 2024 21:20
@Andrea-Havron-NOAA
Copy link
Collaborator Author

@Andrea-Havron-NOAA did you want to rebase this to dev while you are doing the changes to cli::*?

yes, I can work on this rebase.

I have addressed the most recent edits, sqashed all commits, and rebased with dev

@kellijohnson-NOAA
Copy link
Contributor

@Andrea-Havron-NOAA sorry to leave this PR open for so long but I have one more major question ... should we fix the distributions per @Bai-Li-NOAA's comments that they should be more similar, e.g., take an sd argument rather than log_logsd, etc. in this PR? Then, I have a minor question, mainly for @Bai-Li-NOAA should the functions that set up the distributions be named setup_process_distribution() and setup_data_distribution() instead of new_*() to match the functions that you are creating?

@Bai-Li-NOAA
Copy link
Contributor

@kellijohnson-NOAA, I'm going to chat with Andrea about the distributions today. Along with making the distribution arguments more similar, I also want to figure out which fields of the distribution object should be accessible to users.

For the R function names, please feel free to leave them as they are for now. I’ll handle the refactoring later in the branch I’m working on. I’m thinking about using initialize_*() instead of setup_*() if the function involves using methods::new() to initialize an object.

@kellijohnson-NOAA
Copy link
Contributor

Ahh thanks for the insight @Bai-Li-NOAA 😃 I am guessing that @Andrea-Havron-NOAA started their naming with new_* because they use methods::new() but I didn't put it together 😕. I will wait until after your meeting with @Andrea-Havron-NOAA to decide if more refactoring should be done here or if this should be merged.

@Andrea-Havron-NOAA
Copy link
Collaborator Author

@Andrea-Havron-NOAA sorry to leave this PR open for so long but I have one more major question ... should we fix the distributions per @Bai-Li-NOAA's comments that they should be more similar, e.g., take an sd argument rather than log_logsd, etc. in this PR? Then, I have a minor question, mainly for @Bai-Li-NOAA should the functions that set up the distributions be named setup_process_distribution() and setup_data_distribution() instead of new_*() to match the functions that you are creating?

@kellijohnson-NOAA, @Bai-Li-NOAA and I think we need the changes from both this PR and her branch to move forward with standardizing distribution arguments. After this branch gets rebased with dev, she can rebase her branch with dev and I can create a new branch off of hers to work on these changes.

* add data error checks in information
* add get and set id functions to fleet interface and expose in rcpp_interface
* update demo and tests
* add distribution helper functions and tests
* add tests to fleet interface
* Fix formatting for tidyverse style.
* Increase some of the documentation to better explain parameters.
* Share argument documentation across functions.
* Use match.arg().
* use && and ||
* add documentation
* add examples
* add check validity function
* update error message to use cli formatting
* add new tests
* helper function for expected names
@kellijohnson-NOAA kellijohnson-NOAA merged commit 5c2c721 into dev Oct 23, 2024
9 checks passed
@kellijohnson-NOAA kellijohnson-NOAA deleted the dev-data-to-fleet branch October 23, 2024 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants