-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add data back to fleet #675
Conversation
Instructions for code reviewerHello reviewer, thanks for taking the time to review this PR!
Checklist
|
@msupernaw, can you check that the data is set up in fleet correctly? @Bai-Li-NOAA, can you review the new helper functions for distributions? |
R/distribution_formulas.R
Outdated
#' @param sd A list of length two. The first entry, `"value"`, stores the | ||
#' initial values for the relevant standard deviations. The second entry, | ||
#' `"estimated"` is a vector of booleans indicating whether or not | ||
#' standard deviation is estimated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear if estimated should be a vector if sd is a vector with a length greater than one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check if the user provide both value
and estimated
, and specify default values in the Roxygen documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to documentation
R/distribution_formulas.R
Outdated
if (family$link == "log") { | ||
expected <- "log_expected_index" | ||
} | ||
if (family$link == "identity") { | ||
expected <- "expected_index" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it only say index if it is index data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"log_expected_index" and "expected_index" are internal names in fleet for the expected values calculated in population
3cfbbdd
to
43b331d
Compare
R/distribution_formulas.R
Outdated
if (family$link == "log") { | ||
expected <- "log_expected_index" | ||
} | ||
if (family$link == "identity") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to change all of the family$link
to family[["link"]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
R/distribution_formulas.R
Outdated
families <- c("lognormal", "gaussian") | ||
if (family[["family"]] == "normal") { | ||
stop("use family = gaussian() instead") | ||
} | ||
if (!(family[["family"]] %in% families)) { | ||
stop("FIMS currently does not offer this distribution for processes.") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to create a helper function here and use it for this and the warnings in the previous function instead of having duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created helper function to validate user input
R/distribution_formulas.R
Outdated
#' \item{mu.eta}{ | ||
#' TODO: document mu.eta | ||
#' function: derivative \eqn{TODO}. | ||
#' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs more documentation.
Also, this is a REAL BIG picture comment. {sdmTMB} also has similar functions in families.R. It would be great if we could create a package to store these where both {sdmTMB}, {FIMS}, and any other package that wants to could use them without needing to require a complex package.
R/distribution_formulas.R
Outdated
#' Multinomial family and link specification | ||
#' | ||
#' @param link link function association with family | ||
#' @return An object of class "family" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full list that is returned needs to be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
R/distribution_formulas.R
Outdated
|
||
#' Multinomial family and link specification | ||
#' | ||
#' @param link link function association with family |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be a complete sentence or copied from [lognormal()] using @inheritParams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with these changes. We'll need to give some attention in M2Q to the derived quantity calculations and only calculate what is needed given the provided data. I have concerns about the parallel test for MacOS. I'm assuming that it's unrelated to these changes?
R/distribution_formulas.R
Outdated
) { | ||
data_type <- match.arg(data_type) | ||
families <- c("lognormal", "gaussian", "multinomial") | ||
if (family[["family"]] == "normal") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to validate family structure before indexing into family[["family"]] and family[["link"]]? For example:
if (!all(c("family", "link") %in% names(family))) {
stop("Family must contain both 'family' and 'link' entries.")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each family has a default link function, which is the canonical link function based on the exponential family definition. If no link function is provided, the default argument is used, which is based on the structure of the family class from the stats package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we should probably verify that the family argument is a class type of family.
R/distribution_formulas.R
Outdated
families <- c("lognormal", "gaussian", "multinomial") | ||
if (family[["family"]] == "normal") { | ||
stop("use family = gaussian() instead") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checks are clearly laid out and help prevent invalid configurations! May consider using cli styling, as shown in the examples here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use cli styling, thanks for the suggestion!
new_module <- new(TMBDlnormDistribution) | ||
new_module$log_logsd <- new( | ||
ParameterVector, | ||
log(sd$value), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a check for sd$value
and ensure it's a positive number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the check does not capture the case when they are both greater than one but of different lengths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check_distribution_validity function throws an error in this case. See line 49 of this file.
R/distribution_formulas.R
Outdated
length(sd$value) | ||
) | ||
new_module$log_logsd$set_all_estimable(sd$estimated) | ||
if (family[["link"]] == "log") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code occurs twice in the function (see Lines 64-69). Maybe make this into a helper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created helper function
R/distribution_formulas.R
Outdated
#' fam[["family"]] | ||
#' fam$link | ||
lognormal <- function(link = "log") { | ||
r <- list(family = "lognormal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names r
and f
could be more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
R/distribution_formulas.R
Outdated
#' fam[["family"]] | ||
#' fam$link | ||
multinomial <- function(link = "logit") { | ||
r <- list(family = "multinomial") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The names r
and f
could be more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
fishing_fleet_index_distribution <- | ||
new_data_distribution(data_type = "cpue", module = fishing_fleet, | ||
family = lognormal(link = "log"), | ||
sd = list(value = rep(sqrt(log(em_input$cv.L$fleet1^2 + 1)), om_input$nyr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has been repeated several places. Maybe extract it into a separate variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I have reviewed |
I've addressed all the comments. Thanks for all the great feedback, I think the helper functions are more readable now! @kellijohnson-NOAA and @Bai-Li-NOAA, let me know if you approve the changes or have any follow-up comments. |
R/distribution_formulas.R
Outdated
#' Validaity checks for new_data_distribution and new_process_distribution | ||
#' | ||
check_distribution_validity <- function(args){ | ||
list2env(args, envir = environment()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an FYI this will lead to warning checks on CRAN, not that we are trying for CRAN right now, because there is no way to know what is present in "args".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! I like to try to make the code as CRAN ready as possible so we have less to fix later if we ever decide to submit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed use of list2env
R/distribution_formulas.R
Outdated
if(data_type == "index" || data_type == "cpue"){ | ||
if(family[["family"]] == "lognormal" || family[["family"]] == "gaussian"){ | ||
if(family[["link"]] == "log"){ | ||
expected_name <- "log_expected_index" | ||
} | ||
if(family[["link"]] == "identity"){ | ||
expected_name <- "expected_index" | ||
} | ||
} | ||
} | ||
if(data_type == "agecomp" || data_type == "lengthcomp"){ | ||
expected_name <- "proportion_catch_numbers_at_age" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if statements are present, no else statement. So what happens if you do not fit inside an if statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_type uses match_args to set the value from the subset: index, cpue, agecomp, lengthcomp
. A different data type will throw an error before this part of the code is run. Also, the default expected_name is NA, so technically that is the returned value if input does not fit inside the if statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read that match_arg uses partial matching, so I added an additional validity check to make sure data_type is one of the four options available. I also realized we don't have checks on link functions. sdmTMB adds these checks to the family functions themselves. I added a check to throw an error if the expected_value is still NA while we work out where to put checks on link functions.
@Andrea-Havron-NOAA did you want to rebase this to dev while you are doing the changes to |
R/distribution_formulas.R
Outdated
)) | ||
} | ||
|
||
if ((data_type == "agecomp" || data_type == "lengthcomp") && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change this to if (grepl("comp", data_type)) &&
to be more generic.
R/distribution_formulas.R
Outdated
} | ||
if(!is.null(args$data_type)){ | ||
data_type <- args$data_type | ||
data_type_names <- c("index", "cpue", "agecomp", "lengthcomp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is "cpue" an option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure what descriptior to use here... should it be landings
, or should index
be used to describe both fleet and survey data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"index"
yes, I can work on this rebase. |
7f75666
to
7800a8f
Compare
I have addressed the most recent edits, sqashed all commits, and rebased with dev |
@Andrea-Havron-NOAA sorry to leave this PR open for so long but I have one more major question ... should we fix the distributions per @Bai-Li-NOAA's comments that they should be more similar, e.g., take an sd argument rather than log_logsd, etc. in this PR? Then, I have a minor question, mainly for @Bai-Li-NOAA should the functions that set up the distributions be named |
@kellijohnson-NOAA, I'm going to chat with Andrea about the distributions today. Along with making the distribution arguments more similar, I also want to figure out which fields of the distribution object should be accessible to users. For the R function names, please feel free to leave them as they are for now. I’ll handle the refactoring later in the branch I’m working on. I’m thinking about using |
Ahh thanks for the insight @Bai-Li-NOAA 😃 I am guessing that @Andrea-Havron-NOAA started their naming with new_* because they use methods::new() but I didn't put it together 😕. I will wait until after your meeting with @Andrea-Havron-NOAA to decide if more refactoring should be done here or if this should be merged. |
@kellijohnson-NOAA, @Bai-Li-NOAA and I think we need the changes from both this PR and her branch to move forward with standardizing distribution arguments. After this branch gets rebased with dev, she can rebase her branch with dev and I can create a new branch off of hers to work on these changes. |
* add data error checks in information * add get and set id functions to fleet interface and expose in rcpp_interface * update demo and tests * add distribution helper functions and tests * add tests to fleet interface * Fix formatting for tidyverse style. * Increase some of the documentation to better explain parameters. * Share argument documentation across functions. * Use match.arg(). * use && and || * add documentation * add examples * add check validity function * update error message to use cli formatting * add new tests * helper function for expected names
9dd80c0
to
786a85a
Compare
What is the feature?
How have you implemented the solution?
Does the PR impact any other area of the project, maybe another repo?