Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_cv_index_spt #394

Open
mitchellmanware opened this issue Feb 10, 2025 · 9 comments
Open

generate_cv_index_spt #394

mitchellmanware opened this issue Feb 10, 2025 · 9 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@mitchellmanware
Copy link
Collaborator

@sigmafelix @kyle-messier
I am trying to update the generate_cv_index_spt function to create a spatiotemporal cross validation method based on a combination of leave one location out and leave one time out where one "location" is a spatialsample::spatial_block_cv block and one "time" is a year's worth of data. It would look something like this

Image

I have developed a version on beethoven_dev at https://github.com/mitchellmanware/beethoven_dev/blob/mm-0203/function.R.

The function, as expected, creates 25 (5 spatial x 5 temporal) folds where the testing (aka assessment) data for each set is one spatial block for one year. This is seen by checking the unique years available in the training and testing sets for each split.

> data <- qs2::qs_read("../beethoven/_targets/objects/dt_feat_calc_xyt")
> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> data <- data[
  sample(nrow(data), nrow(data) * 0.3),
  c("site_id", "lon", "lat", "time", "Arithmetic.Mean", "light_00001")
]
> source("function.R")
> spt <- generate_cv_index_spt_dev(
  data = data,
  id = "site_id",
  coords = c("lon", "lat"),
  s_fold = 5L,
  time = "time",
  t_fold = 5L,
  year = TRUE
)
No missing values in `out_id`...
Warning messages:
1: In spatialsample::spatial_block_cv(data_sf, v = s_fold) :
  `spatial_block_cv()` expects your data to have an appropriate coordinate reference system (CRS).
ℹ If possible, try setting a CRS using `sf::st_set_crs()`.
ℹ Otherwise, `spatial_block_cv()` will assume your data is in projected coordinates.
> spt_rset <- beethoven::convert_cv_index_rset(
  cvindex = spt,
  data = data,
  cv_mode = "spatiotemporal"
)
> for (i in seq_along(spt_rset$splits)) {
  train <- rsample::analysis(spt_rset$splits[[i]])
  test <- rsample::assessment(spt_rset$splits[[i]])
  print(unique(substr(train$time, 1, 4)))
  print(unique(substr(test$time, 1, 4)))
}
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2022" "2018" "2020" "2021"
[1] "2018"
[1] "2018" "2022" "2019" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"

Is this a valid CV method? I based it on our custom functions but I don't know if manually making the CV splits like this is ok.

@mitchellmanware mitchellmanware added help wanted Extra attention is needed question Further information is requested labels Feb 10, 2025
@kyle-messier
Copy link
Collaborator

@mitchellmanware I'd say this is indeed a valid CV method.

Couple questions and comments:

  1. If the number of temporal folds is 5, then we'd have 1.25 years be fold, so to get even years I think it would be 6.
  2. We can just specify a year for time, regardless of the number of years in the total dataset, especially since that will increase over the long term
  3. For the 5 spatial folds, are you using the blocks? I think we can do random or snake blocks, but just want to make sure we don't have too large a spatial block, which could lead to an overly pessimistic error estimate.

@mitchellmanware
Copy link
Collaborator Author

  1. If the number of temporal folds is 5, then we'd have 1.25 years be fold, so to get even years I think it would be 6.
  1. We can just specify a year for time, regardless of the number of years in the total dataset, especially since that will increase over the long term
  • Do you mean perform a year-based comparison with == rather than the date comparisons with >= ... <?
  • ie
...
        data_sp$time >= time_vec_split[j] &
          data_sp$time < time_vec_split[j + 1] &
          data_sp$cv == sp_indices[i]
...

becomes

year_vec_split <- sort(unique(substr(data_sp[[time]], 1, 4)))
data_sp$year <- as.integer(substr(data_sp$year, 1, 4))
...
        data_sp$year == year_vec_split[j] & data_sp$cv == sp_indices[i]
...
  1. For the 5 spatial folds, are you using the blocks? I think we can do random or snake blocks, but just want to make sure we don't have too large a spatial block, which could lead to an overly pessimistic error estimate.
  • I did not specify with this example but we can use random or snake as the elipse argument passes additional parameters to the spatialsample::spatial_block_cvcall. This will be clear in documentation

@mitchellmanware
Copy link
Collaborator Author

@sigmafelix
Copy link
Collaborator

sigmafelix commented Feb 12, 2025

@mitchellmanware
I think the combination of spatial block indices and a year will be the easiest way to implement spatiotemporal block CV. However, for the future update with our semiannual update cycle, date-based filtering may be sustainable (preferably with the fixed cutpoint like as.Date(sprintf("%s-06-30", strftime(Sys.Date(), "%Y")))). Flexibility gives us good for adjusting cutpoints as the source data accumulate.

If we don't have to fix the number of folds, I think it would be better to use a semantic (or "common sense") temporal period to make folds, for example, one year or two years.

FYI, as you might have noticed, the initial version of generate_cv_index_spt was designed to make folds with half overlaps intentionally, which made the implementation a little complicated.

@mitchellmanware
Copy link
Collaborator Author

This is good input, thank you @sigmafelix.

When you say "semantic (or "common sense") temporal period to make folds, for example, one year or two years.", do you mean derive the number of folds from the years/dates available in $time?

@mitchellmanware
Copy link
Collaborator Author

For cut points at as.Date(sprintf("%s-06-30", strftime(Sys.Date(), "%Y"))), would this result in 6-month folds instead of year folds?

@sigmafelix
Copy link
Collaborator

@mitchellmanware I think that data in a new 6-month period can be appended to the last fold. Our initial period is five years and there are datasets for two following years (2023-2024), which may result in five folds of 18 months * 4 + 12 months or seven folds (12 months * 7).

@mitchellmanware
Copy link
Collaborator Author

true, but by deriving from the dates in data$time the number of folds will always correspond to the number of available years

...
  t_fold <- year_vec_split <- sort(unique(substr(data_sp[[time_id]], 1, 4)))
  time_vec_split <- as.Date(paste0(year_vec_split, "-01-01"))
  time_vec_split <- c(
    time_vec_split,
    as.Date(paste0(as.numeric(max(year_vec_split)) + 1, "-01-01"))
  )
...

think that also gets into the question of whether the models are re-trained at each new interval or if the first run models are fit to the new temporal points. Although the introduction of new AQS sites will also introduce new spatial points as well.

@sigmafelix
Copy link
Collaborator

Then I will modify the function to convert date objects to integer to get temporal blocks for spatiotemporal block CV. I think that the model needs to be retrained with the newly acquired datasets. The CV folds are generated fairly fast as the CV indexing function is lightweight (i.e., returning only integers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants