feat(pipeline): add -f/--force flag to bypass cache on pipeline run#169
Open
sanskar0627 wants to merge 1 commit intom-lab:mainfrom
Open
feat(pipeline): add -f/--force flag to bypass cache on pipeline run#169sanskar0627 wants to merge 1 commit intom-lab:mainfrom
sanskar0627 wants to merge 1 commit intom-lab:mainfrom
Conversation
Collaborator
|
After listing our project for GSoC, we received a large amount of pull requests across several repositories. We are dealing with the backlog, but this would take time. We will get back to this pull request eventually. In the meanwhile, if you are a GSoC applicant, please read our updated GSoC policy: https://github.com/m-lab/gsoc/. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
-f/--forceFlag toiqb pipeline runfix #168
What
Adds
-f/--forceflag toiqb pipeline runso users can bypass the local cache and force fresh BigQuery queries. This addresses the TODO at the top ofpipeline_run.py.Why
When
pipeline runexecutes, it caches query results asdata.parquetandstats.jsonfiles under the data directory. Once these files exist, every subsequentpipeline runskips BigQuery entirely there is no CLI-level way to force a re-query.This becomes a problem when:
The pipeline short-circuits at two places:
sync_mlab()gate — >checksentry.exists()and skips callingentry.sync()altogether if cache files are on disk._bq_syncer()guard —> even ifsync()is called, the BigQuery syncer returns early without querying when cache files exist.Both of these need to be bypassed for
--forceto work properly.Changes
pipeline_run.py@click.option("-f", "--force", ...)to theruncommand, matching the existing--forcepattern fromcache pullforce=forceintosync_mlab()iqb_pipeline.pyforce: bool = Falseparameter tosync_mlab()if not entry.exists()toif force or not entry.exists()forcethrough toget_cache_entry()pipeline.pyforce: bool = Falsetoget_cache_entry()and_bq_syncer()force=True, clears any existing syncers (e.g. remote cache) before registering the BigQuery syncerfunctools.partialto passforceinto_bq_syncerwhile keeping the syncer callable signature unchanged_bq_syncernow only skips whennot force and entry.exists()No changes were made to
PipelineCacheEntry,PipelineCacheManager, or thePipelineRemoteCacheprotocol.Tests
pipeline_run_test.pyTestPipelineRunForceFlaginvokes CLI with--forceand verifiessync_mlabreceivesforce=Trueforce=Falseon default callsiqb_pipeline_test.pytest_force_syncs_existing_entriescreates entries whereexists=True, callssync_mlab(force=True), and verifies both entries get synced (contrasts with existingtest_skips_existing_entries)pipeline_test.pytest_bq_syncer_force_queries_when_existscreates cache files on disk, callsget_cache_entry(force=True), triggerssync(), and verifies BigQueryexecute_querywas called (contrasts with existingtest_bq_syncer_skip_when_exists)All 290 tests pass. Ruff and pyright report zero issues.