Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

245 fix querying public extraction function #246

Merged
merged 8 commits into from
Jan 13, 2025

Conversation

cbutsko
Copy link
Contributor

@cbutsko cbutsko commented Jan 9, 2025

Main changes:

  • fixed bug that caused excessive samples dropout when trying to align available extraction with user-defined season
  • renamed duplicate process_parquet function to avoid confusion with original one from presto-worldcereal repo

@cbutsko cbutsko self-assigned this Jan 9, 2025
@cbutsko cbutsko linked an issue Jan 9, 2025 that may be closed by this pull request
3 tasks
Copy link
Contributor

@kvantricht kvantricht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find any wrong logic at first sight, but hard to check purely from code. Thanks for adding tests!!

@jdegerickx
Copy link
Contributor

I'm too stupid to understand what happens :)
I tried testing this locally. If I have a sample with valid_date of 2020-03-01 and a processing period marked by the user as 2018-01-01 -> 2018-12-31 I get a proposed valid date of 2019-11-01. Is this the expected behaviour?

start_date = row["start_date"]
end_date = row["end_date"]

proposed_valid_date_fwd = valid_date + pd.DateOffset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct? Shouldn't it be '-' in the first one and '+' in the second one?
Guess I don't understand what it is used for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, but this is exactly the case, no? like here in these lines

proposed_valid_date_fwd = valid_date + pd.DateOffset(
months=row["valid_month_shift_forward"]
)
proposed_valid_date_bwd = valid_date - pd.DateOffset(
months=row["valid_month_shift_backward"]
)

or do you refer to something else?

@@ -284,7 +286,7 @@ def process_parquet(
pd.DataFrame
processed dataframe with the necessary columns for training.
"""
from presto.utils import process_parquet as process_parquet_for_presto
from presto.utils import process_parquet

logger.info("Processing selected samples ...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way how processing_period_middle_ts is defined a few lines below doesn't seem to be future proof?
What if we move away from 12 month processing periods?

Copy link
Contributor Author

@cbutsko cbutsko Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point. I tried to make a small step towards more generic time series handling here (925ff21)
For the default case that we have (12 monthly timesteps), it replicates the previous implementation. It can also handle other frequencies/lengths.
But we might need to add similar logic to other relevant places as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we anticipate that this will happen within the CCN timeframe? Because if not, we may want to keep this as nice to have (track an issue somewhere) but not get distracted by this future option too much.

Copy link
Contributor

@jdegerickx jdegerickx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could briefly sit together and you once again talk me through this.
I lost the scheme you once drew which explains what should happen...

@cbutsko
Copy link
Contributor Author

cbutsko commented Jan 13, 2025

I'm too stupid to understand what happens :) I tried testing this locally. If I have a sample with valid_date of 2020-03-01 and a processing period marked by the user as 2018-01-01 -> 2018-12-31 I get a proposed valid date of 2019-11-01. Is this the expected behaviour?

No, this is not right. If extractions range allows, for this case the proposed_valid_date should be 2020-06-01. I also can't seem to reproduce this case with the code... Can you please share the start_date and end_date for the sample (not the user-defined range, but the min and max dates for which actual extractions are available)? I'll try to debug and prepare a better explanation, and let's discuss further during the meeting.

@jdegerickx
Copy link
Contributor

OK, looks better now!
Other question: should we introduce some sort of maximum shift? Would we really trust samples for which the validity time needs to be shifted by more than 6 months?

@cbutsko
Copy link
Contributor Author

cbutsko commented Jan 13, 2025

No, this is not right. If extractions range allows, for this case the proposed_valid_date should be 2020-06-01. I also can't seem to reproduce this case with the code... Can you please share the start_date and end_date for the sample (not the user-defined range, but the min and max dates for which actual extractions are available)? I'll try to debug and prepare a better explanation, and let's discuss further during the meeting.

Okay, I found a small mixup bug that was causing this, fixed here de55b03

Copy link
Contributor

@jdegerickx jdegerickx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, PR is ready to be merged.
Seperate issue to be created for informing the user on how validity time has been shifted using a different attribute...

@cbutsko
Copy link
Contributor Author

cbutsko commented Jan 13, 2025

OK, looks better now! Other question: should we introduce some sort of maximum shift? Would we really trust samples for which the validity time needs to be shifted by more than 6 months?

Summary of an offline discussion:
We leave it here as is, but create a new issue to (possibly) track that #247

@cbutsko cbutsko merged commit 99e1909 into main Jan 13, 2025
4 checks passed
@cbutsko cbutsko deleted the 245-fix-querying-public-extraction-function branch January 13, 2025 16:30
@kvantricht kvantricht linked an issue Jan 20, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix querying public extraction function training sample selection based on validity time
3 participants