245 fix querying public extraction function #246

cbutsko · 2025-01-09T12:47:38Z

Main changes:

fixed bug that caused excessive samples dropout when trying to align available extraction with user-defined season
renamed duplicate process_parquet function to avoid confusion with original one from presto-worldcereal repo

…traction-function

kvantricht

Can't find any wrong logic at first sight, but hard to check purely from code. Thanks for adding tests!!

jdegerickx · 2025-01-10T09:38:12Z

I'm too stupid to understand what happens :)
I tried testing this locally. If I have a sample with valid_date of 2020-03-01 and a processing period marked by the user as 2018-01-01 -> 2018-12-31 I get a proposed valid date of 2019-11-01. Is this the expected behaviour?

jdegerickx · 2025-01-10T09:39:36Z

src/worldcereal/utils/refdata.py

+    start_date = row["start_date"]
+    end_date = row["end_date"]
+
+    proposed_valid_date_fwd = valid_date + pd.DateOffset(


is this correct? Shouldn't it be '-' in the first one and '+' in the second one?
Guess I don't understand what it is used for

hmm, but this is exactly the case, no? like here in these lines

worldcereal-classification/src/worldcereal/utils/refdata.py

Lines 244 to 249 in 44934e8

proposed_valid_date_fwd = valid_date + pd.DateOffset(

months=row["valid_month_shift_forward"]

)

proposed_valid_date_bwd = valid_date - pd.DateOffset(

months=row["valid_month_shift_backward"]

)

or do you refer to something else?

jdegerickx · 2025-01-10T09:41:41Z

src/worldcereal/utils/refdata.py

@@ -284,7 +286,7 @@ def process_parquet(
    pd.DataFrame
        processed dataframe with the necessary columns for training.
    """
-    from presto.utils import process_parquet as process_parquet_for_presto
+    from presto.utils import process_parquet

    logger.info("Processing selected samples ...")


the way how processing_period_middle_ts is defined a few lines below doesn't seem to be future proof?
What if we move away from 12 month processing periods?

that's a good point. I tried to make a small step towards more generic time series handling here (925ff21)
For the default case that we have (12 monthly timesteps), it replicates the previous implementation. It can also handle other frequencies/lengths.
But we might need to add similar logic to other relevant places as well.

Do we anticipate that this will happen within the CCN timeframe? Because if not, we may want to keep this as nice to have (track an issue somewhere) but not get distracted by this future option too much.

jdegerickx

Perhaps we could briefly sit together and you once again talk me through this.
I lost the scheme you once drew which explains what should happen...

cbutsko · 2025-01-13T08:57:54Z

I'm too stupid to understand what happens :) I tried testing this locally. If I have a sample with valid_date of 2020-03-01 and a processing period marked by the user as 2018-01-01 -> 2018-12-31 I get a proposed valid date of 2019-11-01. Is this the expected behaviour?

No, this is not right. If extractions range allows, for this case the proposed_valid_date should be 2020-06-01. I also can't seem to reproduce this case with the code... Can you please share the start_date and end_date for the sample (not the user-defined range, but the min and max dates for which actual extractions are available)? I'll try to debug and prepare a better explanation, and let's discuss further during the meeting.

jdegerickx · 2025-01-13T13:13:13Z

OK, looks better now!
Other question: should we introduce some sort of maximum shift? Would we really trust samples for which the validity time needs to be shifted by more than 6 months?

cbutsko · 2025-01-13T16:12:42Z

No, this is not right. If extractions range allows, for this case the proposed_valid_date should be 2020-06-01. I also can't seem to reproduce this case with the code... Can you please share the start_date and end_date for the sample (not the user-defined range, but the min and max dates for which actual extractions are available)? I'll try to debug and prepare a better explanation, and let's discuss further during the meeting.

Okay, I found a small mixup bug that was causing this, fixed here de55b03

jdegerickx

ok, PR is ready to be merged.
Seperate issue to be created for informing the user on how validity time has been shifted using a different attribute...

cbutsko · 2025-01-13T16:29:22Z

OK, looks better now! Other question: should we introduce some sort of maximum shift? Would we really trust samples for which the validity time needs to be shifted by more than 6 months?

Summary of an offline discussion:
We leave it here as is, but create a new issue to (possibly) track that #247

Butsko Christina added 6 commits January 9, 2025 12:43

complete refactoring of get_best_valid_date function

3141a75

even more concise computation of month_diff

f24569e

adding test for get_best_valid_date function

c55a9c6

formatting

cb9bddf

renamed duplicated process_parquet function

6886157

Merge remote-tracking branch 'origin' into 245-fix-querying-public-ex…

44934e8

…traction-function

cbutsko requested review from kvantricht and jdegerickx January 9, 2025 12:47

cbutsko self-assigned this Jan 9, 2025

cbutsko linked an issue Jan 9, 2025 that may be closed by this pull request

Fix querying public extraction function #245

Closed

3 tasks

kvantricht approved these changes Jan 9, 2025

View reviewed changes

jdegerickx reviewed Jan 10, 2025

View reviewed changes

Butsko Christina and others added 2 commits January 13, 2025 10:35

more generic definition of processing_period_middle_ts

925ff21

forward and backward mixup 🤦‍♀️

de55b03

jdegerickx approved these changes Jan 13, 2025

View reviewed changes

cbutsko merged commit 99e1909 into main Jan 13, 2025
4 checks passed

cbutsko deleted the 245-fix-querying-public-extraction-function branch January 13, 2025 16:30

kvantricht linked an issue Jan 20, 2025 that may be closed by this pull request

training sample selection based on validity time #225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

245 fix querying public extraction function #246

245 fix querying public extraction function #246

cbutsko commented Jan 9, 2025

kvantricht left a comment

jdegerickx commented Jan 10, 2025

jdegerickx Jan 10, 2025

cbutsko Jan 13, 2025

jdegerickx Jan 10, 2025

cbutsko Jan 13, 2025 •

edited

Loading

kvantricht Jan 13, 2025

jdegerickx left a comment

cbutsko commented Jan 13, 2025

jdegerickx commented Jan 13, 2025

cbutsko commented Jan 13, 2025

jdegerickx left a comment

cbutsko commented Jan 13, 2025

	proposed_valid_date_fwd = valid_date + pd.DateOffset(
	months=row["valid_month_shift_forward"]
	)
	proposed_valid_date_bwd = valid_date - pd.DateOffset(
	months=row["valid_month_shift_backward"]
	)

245 fix querying public extraction function #246

245 fix querying public extraction function #246

Conversation

cbutsko commented Jan 9, 2025

kvantricht left a comment

Choose a reason for hiding this comment

jdegerickx commented Jan 10, 2025

jdegerickx Jan 10, 2025

Choose a reason for hiding this comment

cbutsko Jan 13, 2025

Choose a reason for hiding this comment

jdegerickx Jan 10, 2025

Choose a reason for hiding this comment

cbutsko Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

kvantricht Jan 13, 2025

Choose a reason for hiding this comment

jdegerickx left a comment

Choose a reason for hiding this comment

cbutsko commented Jan 13, 2025

jdegerickx commented Jan 13, 2025

cbutsko commented Jan 13, 2025

jdegerickx left a comment

Choose a reason for hiding this comment

cbutsko commented Jan 13, 2025

cbutsko Jan 13, 2025 •

edited

Loading