`generate_cv_index_spt` #394

mitchellmanware · 2025-02-10T23:14:33Z

@sigmafelix @kyle-messier
I am trying to update the generate_cv_index_spt function to create a spatiotemporal cross validation method based on a combination of leave one location out and leave one time out where one "location" is a spatialsample::spatial_block_cv block and one "time" is a year's worth of data. It would look something like this

I have developed a version on beethoven_dev at https://github.com/mitchellmanware/beethoven_dev/blob/mm-0203/function.R.

The function, as expected, creates 25 (5 spatial x 5 temporal) folds where the testing (aka assessment) data for each set is one spatial block for one year. This is seen by checking the unique years available in the training and testing sets for each split.

> data <- qs2::qs_read("../beethoven/_targets/objects/dt_feat_calc_xyt")
> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> data <- data[
  sample(nrow(data), nrow(data) * 0.3),
  c("site_id", "lon", "lat", "time", "Arithmetic.Mean", "light_00001")
]
> source("function.R")
> spt <- generate_cv_index_spt_dev(
  data = data,
  id = "site_id",
  coords = c("lon", "lat"),
  s_fold = 5L,
  time = "time",
  t_fold = 5L,
  year = TRUE
)
No missing values in `out_id`...
Warning messages:
1: In spatialsample::spatial_block_cv(data_sf, v = s_fold) :
  `spatial_block_cv()` expects your data to have an appropriate coordinate reference system (CRS).
ℹ If possible, try setting a CRS using `sf::st_set_crs()`.
ℹ Otherwise, `spatial_block_cv()` will assume your data is in projected coordinates.
> spt_rset <- beethoven::convert_cv_index_rset(
  cvindex = spt,
  data = data,
  cv_mode = "spatiotemporal"
)
> for (i in seq_along(spt_rset$splits)) {
  train <- rsample::analysis(spt_rset$splits[[i]])
  test <- rsample::assessment(spt_rset$splits[[i]])
  print(unique(substr(train$time, 1, 4)))
  print(unique(substr(test$time, 1, 4)))
}
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2022" "2018" "2020" "2021"
[1] "2018"
[1] "2018" "2022" "2019" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2018"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2019"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2020"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2021"
[1] "2019" "2018" "2022" "2020" "2021"
[1] "2022"

Is this a valid CV method? I based it on our custom functions but I don't know if manually making the CV splits like this is ok.

The text was updated successfully, but these errors were encountered:

kyle-messier · 2025-02-11T02:23:59Z

@mitchellmanware I'd say this is indeed a valid CV method.

Couple questions and comments:

If the number of temporal folds is 5, then we'd have 1.25 years be fold, so to get even years I think it would be 6.
We can just specify a year for time, regardless of the number of years in the total dataset, especially since that will increase over the long term
For the 5 spatial folds, are you using the blocks? I think we can do random or snake blocks, but just want to make sure we don't have too large a spatial block, which could lead to an overly pessimistic error estimate.

mitchellmanware · 2025-02-11T11:47:54Z

If the number of temporal folds is 5, then we'd have 1.25 years be fold, so to get even years I think it would be 6.

The function detects the maximum data and appends the first date of the following year to ensure one CV fold aligns with one range in `time_vec_split
https://github.com/mitchellmanware/beethoven_dev/blob/a8f2fca562593b15c2bb1fb04d421b46d5f97437/function.R#L82

We can just specify a year for time, regardless of the number of years in the total dataset, especially since that will increase over the long term

Do you mean perform a year-based comparison with == rather than the date comparisons with >= ... <?
ie

...
        data_sp$time >= time_vec_split[j] &
          data_sp$time < time_vec_split[j + 1] &
          data_sp$cv == sp_indices[i]
...

becomes

year_vec_split <- sort(unique(substr(data_sp[[time]], 1, 4)))
data_sp$year <- as.integer(substr(data_sp$year, 1, 4))
...
        data_sp$year == year_vec_split[j] & data_sp$cv == sp_indices[i]
...

For the 5 spatial folds, are you using the blocks? I think we can do random or snake blocks, but just want to make sure we don't have too large a spatial block, which could lead to an overly pessimistic error estimate.

I did not specify with this example but we can use random or snake as the elipse argument passes additional parameters to the spatialsample::spatial_block_cvcall. This will be clear in documentation

mitchellmanware · 2025-02-11T19:16:13Z

@sigmafelix @kyle-messier
Package and associated paper for spatiotemporal cross validation methods
https://mlr3spatiotempcv.mlr-org.com/reference/index.html#spatiotemporal-blocking-methods
https://www.jstatsoft.org/article/view/v111i07

sigmafelix · 2025-02-12T02:34:44Z

@mitchellmanware
I think the combination of spatial block indices and a year will be the easiest way to implement spatiotemporal block CV. However, for the future update with our semiannual update cycle, date-based filtering may be sustainable (preferably with the fixed cutpoint like as.Date(sprintf("%s-06-30", strftime(Sys.Date(), "%Y")))). Flexibility gives us good for adjusting cutpoints as the source data accumulate.

If we don't have to fix the number of folds, I think it would be better to use a semantic (or "common sense") temporal period to make folds, for example, one year or two years.

FYI, as you might have noticed, the initial version of generate_cv_index_spt was designed to make folds with half overlaps intentionally, which made the implementation a little complicated.

mitchellmanware · 2025-02-12T12:34:31Z

This is good input, thank you @sigmafelix.

When you say "semantic (or "common sense") temporal period to make folds, for example, one year or two years.", do you mean derive the number of folds from the years/dates available in $time?

mitchellmanware · 2025-02-12T13:17:38Z

For cut points at as.Date(sprintf("%s-06-30", strftime(Sys.Date(), "%Y"))), would this result in 6-month folds instead of year folds?

sigmafelix · 2025-02-12T13:57:31Z

@mitchellmanware I think that data in a new 6-month period can be appended to the last fold. Our initial period is five years and there are datasets for two following years (2023-2024), which may result in five folds of 18 months * 4 + 12 months or seven folds (12 months * 7).

mitchellmanware · 2025-02-12T14:32:28Z

true, but by deriving from the dates in data$time the number of folds will always correspond to the number of available years

...
  t_fold <- year_vec_split <- sort(unique(substr(data_sp[[time_id]], 1, 4)))
  time_vec_split <- as.Date(paste0(year_vec_split, "-01-01"))
  time_vec_split <- c(
    time_vec_split,
    as.Date(paste0(as.numeric(max(year_vec_split)) + 1, "-01-01"))
  )
...

think that also gets into the question of whether the models are re-trained at each new interval or if the first run models are fit to the new temporal points. Although the introduction of new AQS sites will also introduce new spatial points as well.

sigmafelix · 2025-02-14T00:29:09Z

Then I will modify the function to convert date objects to integer to get temporal blocks for spatiotemporal block CV. I think that the model needs to be retrained with the newly acquired datasets. The CV folds are generated fairly fast as the CV indexing function is lightweight (i.e., returning only integers).

mitchellmanware added help wanted Extra attention is needed question Further information is requested labels Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`generate_cv_index_spt` #394

`generate_cv_index_spt` #394

mitchellmanware commented Feb 10, 2025

kyle-messier commented Feb 11, 2025

mitchellmanware commented Feb 11, 2025

mitchellmanware commented Feb 11, 2025

sigmafelix commented Feb 12, 2025 •

edited

Loading

mitchellmanware commented Feb 12, 2025

mitchellmanware commented Feb 12, 2025

sigmafelix commented Feb 12, 2025

mitchellmanware commented Feb 12, 2025

sigmafelix commented Feb 14, 2025

generate_cv_index_spt #394

generate_cv_index_spt #394

Comments

mitchellmanware commented Feb 10, 2025

kyle-messier commented Feb 11, 2025

mitchellmanware commented Feb 11, 2025

mitchellmanware commented Feb 11, 2025

sigmafelix commented Feb 12, 2025 • edited Loading

mitchellmanware commented Feb 12, 2025

mitchellmanware commented Feb 12, 2025

sigmafelix commented Feb 12, 2025

mitchellmanware commented Feb 12, 2025

sigmafelix commented Feb 14, 2025

`generate_cv_index_spt` #394

`generate_cv_index_spt` #394

sigmafelix commented Feb 12, 2025 •

edited

Loading