Skip to content

Comments

feat(pipeline): add -f/--force flag to bypass cache on pipeline run#169

Open
sanskar0627 wants to merge 1 commit intom-lab:mainfrom
sanskar0627:pipeline_run_force
Open

feat(pipeline): add -f/--force flag to bypass cache on pipeline run#169
sanskar0627 wants to merge 1 commit intom-lab:mainfrom
sanskar0627:pipeline_run_force

Conversation

@sanskar0627
Copy link

Add -f/--force Flag to iqb pipeline run

fix #168

What

Adds -f/--force flag to iqb pipeline run so users can bypass the local cache and force fresh BigQuery queries. This addresses the TODO at the top of pipeline_run.py.

Why

When pipeline run executes, it caches query results as data.parquet and stats.json files under the data directory. Once these files exist, every subsequent pipeline run skips BigQuery entirely there is no CLI-level way to force a re-query.

This becomes a problem when:

  • Upstream BigQuery data has been updated
  • Cached results are known to be stale or corrupted
  • A developer wants to verify that queries still produce correct results

The pipeline short-circuits at two places:

  1. sync_mlab() gate — >checks entry.exists() and skips calling entry.sync() altogether if cache files are on disk.
  2. _bq_syncer() guard —> even if sync() is called, the BigQuery syncer returns early without querying when cache files exist.

Both of these need to be bypassed for --force to work properly.

Changes

pipeline_run.py

  • Removed the TODO comment
  • Added @click.option("-f", "--force", ...) to the run command, matching the existing --force pattern from cache pull
  • Passes force=force into sync_mlab()

iqb_pipeline.py

  • Added force: bool = False parameter to sync_mlab()
  • Changed the cache gate from if not entry.exists() to if force or not entry.exists()
  • Passes force through to get_cache_entry()

pipeline.py

  • Added force: bool = False to get_cache_entry() and _bq_syncer()
  • When force=True, clears any existing syncers (e.g. remote cache) before registering the BigQuery syncer
  • Uses functools.partial to pass force into _bq_syncer while keeping the syncer callable signature unchanged
  • _bq_syncer now only skips when not force and entry.exists()

No changes were made to PipelineCacheEntry, PipelineCacheManager, or the PipelineRemoteCache protocol.

Tests

pipeline_run_test.py

  • Added TestPipelineRunForceFlag invokes CLI with --force and verifies sync_mlab receives force=True
  • Updated existing tests to assert force=False on default calls

iqb_pipeline_test.py

  • Added test_force_syncs_existing_entries creates entries where exists=True, calls sync_mlab(force=True), and verifies both entries get synced (contrasts with existing test_skips_existing_entries)

pipeline_test.py

  • Added test_bq_syncer_force_queries_when_exists creates cache files on disk, calls get_cache_entry(force=True), triggers sync(), and verifies BigQuery execute_query was called (contrasts with existing test_bq_syncer_skip_when_exists)

All 290 tests pass. Ruff and pyright report zero issues.

@bassosimone
Copy link
Collaborator

After listing our project for GSoC, we received a large amount of pull requests across several repositories. We are dealing with the backlog, but this would take time. We will get back to this pull request eventually. In the meanwhile, if you are a GSoC applicant, please read our updated GSoC policy: https://github.com/m-lab/gsoc/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pipeline run ignores --force , no way to bypass cache and re-query BigQuery

2 participants