Skip to content

feat(helm): Add multi-node deployment support with configurable scheduling.#1829

Merged
junhaoliao merged 20 commits intoy-scope:mainfrom
junhaoliao:helm-proper-multi-node
Jan 14, 2026
Merged

feat(helm): Add multi-node deployment support with configurable scheduling.#1829
junhaoliao merged 20 commits intoy-scope:mainfrom
junhaoliao:helm-proper-multi-node

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Dec 26, 2025

Description

Note

This PR is part of the ongoing work for #1309. More PRs will be submitted until the Helm chart is complete and fully functional.

Adds support for multi-node deployments with configurable worker scheduling and automatic
control-plane toleration.

Key Features

  • Deployment mode flag (distributedDeployment): Controls PersistentVolume type and scheduling behavior

    • distributedDeployment=false (default): Single-node mode with local PVs bound to control-plane node;
      pods automatically tolerate control-plane taints
    • distributedDeployment=true: Multi-node mode with hostPath PVs (assumes shared storage like NFS/CephFS
      mounted at the same path on all nodes)
  • Per-component scheduling: Each worker type can be configured with:

    • nodeSelector: Target specific node labels
    • affinity: Pod/node affinity rules
    • tolerations: Tolerate node taints
    • topologySpreadConstraints: Spread pods across failure domains
  • Configurable worker replicas: compressionWorker.replicas and queryWorker.replicas

  • StorageClass management: Chart creates local-storage StorageClass with
    WaitForFirstConsumer binding mode by default

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ran test.sh - single-node deployment with all pods running
  • Ran test-multi-shared.sh - multi-node deployment with shared worker nodes
  • Ran test-multi-dedicated.sh - multi-node deployment with dedicated node pools; verified
    compression-worker scheduled on compression node, query-worker scheduled on query node

Summary by CodeRabbit

  • New Features

    • Configurable replicas for compression, query and reducer workers.
    • New scheduling options (node selectors, affinities, tolerations, topologySpread) with distribution-aware defaults.
    • Conditional static/local persistent volumes and optional StorageClass for local deployments.
    • New setup scripts to automate single- and multi-node KinD test clusters.
  • Chores

    • Added deployment flags for distributed mode and worker concurrency; chart metadata version updated.

✏️ Tip: You can customize this high-level summary in your review settings.

@junhaoliao junhaoliao requested a review from a team as a code owner December 26, 2025 12:03
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 26, 2025

Caution

Review failed

Failed to post review comments

Walkthrough

Helm chart and test tooling updated: chart version bumped; PV helper renamed and extended for distributed vs single-node modes; scheduling and replica configuration added; many PV templates updated to use the new helper; StorageClass template added; new modular KinD test setup scripts added and legacy test.sh removed.

Changes

Cohort / File(s) Summary
Helper template & scheduling
tools/deployment/package-helm/templates/_helpers.tpl
Renamed clp.createLocalPvclp.createStaticPv; added clp.createSchedulingConfigs; added clp.awsConfigVolume and clp.awsConfigVolumeMount; PV logic now branches on distributedDeployment and sources storage.storageClassName.
PV template updates
tools/deployment/package-helm/templates/*-logs-pv.yaml, .../*-data-pv.yaml, .../*-staged-*-pv.yaml (many files under templates)
Replaced calls from clp.createLocalPv to clp.createStaticPv across PV templates; removed nodeRole arguments; PV creation now uses the new helper behavior.
Deployment templates
tools/deployment/package-helm/templates/compression-worker-deployment.yaml, tools/deployment/package-helm/templates/query-worker-deployment.yaml, tools/deployment/package-helm/templates/reducer-deployment.yaml
Parameterized replicas (.Values.*.replicas), added clp.createSchedulingConfigs in pod spec, and introduced conditional volume mounts for archives, staged-archives, streams, aws config and logs input based on storage/type values.
Storage config & chart version
tools/deployment/package-helm/Chart.yaml, tools/deployment/package-helm/values.yaml, tools/deployment/package-helm/templates/storage-class.yaml
Chart version bumped to 0.1.2-dev.18; added distributedDeployment, workerConcurrency, per-worker replicas blocks, and storage.storageClassName; added conditional local-storage StorageClass template.
Test infra scripts
tools/deployment/package-helm/.set-up-common.sh, tools/deployment/package-helm/set-up-test.sh, tools/deployment/package-helm/set-up-multi-shared-test.sh, tools/deployment/package-helm/set-up-multi-dedicated-test.sh, removed tools/deployment/package-helm/test.sh
Added .set-up-common.sh with shared KinD/chart helper functions (directory setup, sample download, Kind config generation, pod readiness polling); added three setup scripts for single-node, multi-shared and multi-dedicated scenarios; removed legacy test.sh.

Sequence Diagram(s)

sequenceDiagram
    participant Script as Setup Script
    participant Common as .set-up-common.sh
    participant Kind as Kind CLI
    participant Helm as Helm CLI
    participant K8s as Kubernetes API
    participant Kubectl as kubectl

    Script->>Common: prepare_environment(cluster_name)
    Common->>Common: create_clp_directories()
    Common->>Common: download_samples() [background]

    Script->>Common: generate_kind_config(worker_count)
    Common-->>Script: Kind cluster config YAML

    Script->>Kind: kind create cluster --config
    Kind-->>K8s: Create nodes

    Script->>Helm: helm install (chart, values)
    Helm->>K8s: Apply manifests (StorageClass, PVs, Deployments)
    K8s-->>Kubectl: Resource statuses

    Script->>Common: wait_for_cluster_ready()
    Common->>Common: wait for sample download completion
    Common->>Kubectl: poll Jobs & Pods
    Kubectl-->>K8s: get Pod/Job status
    K8s-->>Common: Ready/Complete
    Common-->>Script: ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main objective: adding multi-node deployment support with configurable scheduling to the Helm chart.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1f88ee and fe8365f.

📒 Files selected for processing (40)
  • tools/deployment/package-helm/.test-common.sh
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package-helm/templates/_helpers.tpl
  • tools/deployment/package-helm/templates/api-server-deployment.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pv.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pvc.yaml
  • tools/deployment/package-helm/templates/api-server-service.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-user-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
  • tools/deployment/package-helm/templates/compression-worker-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-tmp-pv.yaml
  • tools/deployment/package-helm/templates/configmap.yaml
  • tools/deployment/package-helm/templates/database-data-pv.yaml
  • tools/deployment/package-helm/templates/database-logs-pv.yaml
  • tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
  • tools/deployment/package-helm/templates/query-scheduler-logs-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-logs-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
  • tools/deployment/package-helm/templates/queue-logs-pv.yaml
  • tools/deployment/package-helm/templates/redis-data-pv.yaml
  • tools/deployment/package-helm/templates/redis-logs-pv.yaml
  • tools/deployment/package-helm/templates/reducer-logs-pv.yaml
  • tools/deployment/package-helm/templates/results-cache-data-pv.yaml
  • tools/deployment/package-helm/templates/results-cache-logs-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-archives-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pv.yaml
  • tools/deployment/package-helm/templates/storage-class.yaml
  • tools/deployment/package-helm/templates/webui-deployment.yaml
  • tools/deployment/package-helm/test-multi-dedicated.sh
  • tools/deployment/package-helm/test-multi-shared.sh
  • tools/deployment/package-helm/test.sh
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (10)
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/templates/api-server-deployment.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pv.yaml
  • tools/deployment/package-helm/templates/configmap.yaml
  • tools/deployment/package-helm/templates/api-server-service.yaml
📚 Learning: 2025-11-03T16:17:40.223Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1535
File: components/clp-rust-utils/src/clp_config/package/config.rs:47-61
Timestamp: 2025-11-03T16:17:40.223Z
Learning: In the y-scope/clp repository, the `ApiServer` struct in `components/clp-rust-utils/src/clp_config/package/config.rs` is a Rust-native configuration type and does not mirror any Python code, unlike other structs in the same file (Config, Database, ResultsCache, Package) which are mirrors of Python definitions.

Applied to files:

  • tools/deployment/package-helm/templates/api-server-deployment.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pv.yaml
  • tools/deployment/package-helm/templates/configmap.yaml
  • tools/deployment/package-helm/templates/api-server-service.yaml
📚 Learning: 2025-12-04T03:31:55.239Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1690
File: tools/deployment/package/docker-compose-all.yaml:424-427
Timestamp: 2025-12-04T03:31:55.239Z
Learning: In tools/deployment/package/docker-compose-all.yaml, the query-worker service writes to /var/data/streams (CLP_STREAM_OUTPUT_DIR_HOST mount), so this directory must remain read-write and should not be mounted with :ro flag.

Applied to files:

  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml
  • tools/deployment/package-helm/templates/webui-deployment.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
📚 Learning: 2025-05-06T10:07:04.654Z
Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.

Applied to files:

  • tools/deployment/package-helm/.test-common.sh
  • tools/deployment/package-helm/test.sh
📚 Learning: 2025-07-07T17:43:04.349Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Applied to files:

  • tools/deployment/package-helm/.test-common.sh
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.

Applied to files:

  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • tools/deployment/package-helm/values.yaml
📚 Learning: 2025-11-28T15:12:53.530Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1655
File: components/clp-package-utils/clp_package_utils/controller.py:183-189
Timestamp: 2025-11-28T15:12:53.530Z
Learning: In the y-scope/clp project, root database credentials (root username and password) are required configuration items for all deployments. The `credentials[ClpDbUserType.ROOT]` entry is guaranteed to exist and can be accessed directly without optional handling.

Applied to files:

  • tools/deployment/package-helm/values.yaml
🧬 Code graph analysis (2)
tools/deployment/package-helm/test-multi-dedicated.sh (2)
tools/deployment/package-helm/.test-common.sh (3)
  • create_clp_directories (58-68)
  • download_samples (72-83)
  • wait_for_pods (18-55)
tools/deployment/package-helm/test.sh (1)
  • wait_for_pods (17-54)
tools/deployment/package-helm/test-multi-shared.sh (2)
tools/deployment/package-helm/.test-common.sh (3)
  • create_clp_directories (58-68)
  • download_samples (72-83)
  • wait_for_pods (18-55)
tools/deployment/package-helm/test.sh (1)
  • wait_for_pods (17-54)
🪛 Shellcheck (0.11.0)
tools/deployment/package-helm/test-multi-dedicated.sh

[info] 11-11: Not following: ./.test-common.sh was not specified as input (see shellcheck -x).

(SC1091)


[info] 99-99: Double quote to prevent globbing and word splitting.

(SC2086)

tools/deployment/package-helm/test-multi-shared.sh

[info] 11-11: Not following: ./.test-common.sh was not specified as input (see shellcheck -x).

(SC1091)


[info] 78-78: Double quote to prevent globbing and word splitting.

(SC2086)

tools/deployment/package-helm/.test-common.sh

[warning] 76-76: SAMPLE_DOWNLOAD_PID appears unused. Verify use (or export if used externally).

(SC2034)

🪛 YAMLlint (1.37.1)
tools/deployment/package-helm/templates/results-cache-data-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/api-server-deployment.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 5-5: too many spaces inside braces

(braces)


[error] 5-5: too many spaces inside braces

(braces)


[error] 7-7: too many spaces inside braces

(braces)


[error] 13-13: too many spaces inside braces

(braces)


[error] 18-18: too many spaces inside braces

(braces)


[error] 21-21: too many spaces inside braces

(braces)


[error] 21-21: too many spaces inside braces

(braces)


[error] 24-24: too many spaces inside braces

(braces)


[error] 24-24: too many spaces inside braces

(braces)


[error] 25-25: too many spaces inside braces

(braces)


[error] 25-25: too many spaces inside braces

(braces)


[error] 26-26: too many spaces inside braces

(braces)


[error] 26-26: too many spaces inside braces

(braces)


[error] 32-32: too many spaces inside braces

(braces)


[error] 37-37: too many spaces inside braces

(braces)


[error] 46-46: too many spaces inside braces

(braces)


[error] 46-46: too many spaces inside braces

(braces)


[error] 51-51: too many spaces inside braces

(braces)


[error] 51-51: too many spaces inside braces

(braces)


[error] 61-61: too many spaces inside braces

(braces)


[error] 64-64: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 9-9: too many spaces inside braces

(braces)


[error] 10-10: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/database-data-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/api-server-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 9-9: too many spaces inside braces

(braces)


[error] 10-10: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/query-worker-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/compression-worker-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/database-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/compression-scheduler-user-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/api-server-logs-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/results-cache-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/query-worker-deployment.yaml

[error] 9-9: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)


[error] 23-23: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 2-2: too many spaces inside braces

(braces)


[error] 3-3: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/shared-data-archives-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/reducer-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 2-2: too many spaces inside braces

(braces)


[error] 3-3: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/storage-class.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 7-7: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/redis-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/compression-worker-tmp-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/compression-worker-deployment.yaml

[error] 9-9: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)


[error] 23-23: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/garbage-collector-deployment.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 2-2: too many spaces inside braces

(braces)


[error] 3-3: too many spaces inside braces

(braces)


[error] 64-64: too many spaces inside braces

(braces)


[error] 67-67: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/shared-data-streams-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/queue-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 9-9: too many spaces inside braces

(braces)


[error] 10-10: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/query-scheduler-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/redis-data-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/compression-scheduler-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/api-server-service.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 5-5: too many spaces inside braces

(braces)


[error] 5-5: too many spaces inside braces

(braces)


[error] 7-7: too many spaces inside braces

(braces)


[error] 12-12: too many spaces inside braces

(braces)


[error] 17-17: too many spaces inside braces

(braces)


[error] 17-17: too many spaces inside braces

(braces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: check-generated
🔇 Additional comments (56)
tools/deployment/package-helm/Chart.yaml (1)

3-3: LGTM: Version bump aligns with chart changes.

The version increment from 0.1.2-dev.11 to 0.1.2-dev.16 appropriately reflects the multi-node deployment support and storage refactoring introduced in this PR.

tools/deployment/package-helm/templates/shared-data-streams-pv.yaml (1)

1-8: LGTM: ReadWriteMany access mode is appropriate for shared streams.

The ReadWriteMany access mode on Line 6 correctly enables multiple worker pods to access the shared streams directory concurrently. The migration to createStaticPv is consistent with other PV templates in this PR.

Note: Node affinity verification comment applies (see compression-scheduler-logs-pv.yaml review).

tools/deployment/package-helm/templates/reducer-logs-pv.yaml (1)

1-8: LGTM: Consistent PV migration pattern.

The reducer logs PV follows the same migration pattern as other component PVs in this PR. The change is consistent and appropriate.

Note: Node affinity verification comment applies (see compression-scheduler-logs-pv.yaml review).

tools/deployment/package-helm/templates/query-scheduler-logs-pv.yaml (1)

1-8: LGTM: Consistent PV migration pattern.

The query-scheduler logs PV follows the same migration pattern as other component PVs in this PR.

Note: Node affinity verification comment applies (see compression-scheduler-logs-pv.yaml review).

tools/deployment/package-helm/templates/compression-worker-logs-pv.yaml (1)

1-8: LGTM: Consistent PV migration pattern.

The compression-worker logs PV follows the same migration pattern as other component PVs in this PR.

Note: Node affinity verification comment applies (see compression-scheduler-logs-pv.yaml review).

tools/deployment/package-helm/templates/compression-scheduler-logs-pv.yaml (1)

1-8: Node affinity for single-node deployments is correctly preserved in the new helper.

The createStaticPv helper (lines 142–155 in _helpers.tpl) implements the required conditional logic based on the distributed flag. When distributed=false (single-node mode), it applies nodeAffinity with a required selector targeting nodes with the node-role.kubernetes.io/control-plane label. When distributed=true, it uses hostPath without node affinity. The migration maintains the intended behavior of binding PVs to the control-plane node in single-node deployments.

tools/deployment/package-helm/templates/storage-class.yaml (1)

1-10: LGTM: StorageClass configuration is appropriate for local storage.

The WaitForFirstConsumer binding mode and no-provisioner setting are correct for static local PVs, ensuring volumes bind only after pod scheduling. The conditional rendering when storageClassName equals "local-storage" aligns with the single-node deployment mode. In distributed mode (distributed=true), users are required to provide a different storageClassName (documented as NFS/CephFS) with a pre-existing StorageClass in the cluster, preventing accidental use of the local-storage configuration.

tools/deployment/package-helm/templates/redis-data-pv.yaml (1)

1-8: LGTM! PV migration aligns with distributed deployment support.

The switch from createLocalPv to createStaticPv and removal of the nodeRole parameter enables this PersistentVolume to be provisioned in distributed mode where pods aren't constrained to the control-plane node. This change is consistent with the PR's objectives for multi-node deployment support.

tools/deployment/package-helm/templates/api-server-logs-pvc.yaml (1)

1-9: LGTM! Properly gated PVC creation for API server logs.

The conditional block correctly creates the PVC only when the API server feature is enabled. The capacity (5Gi) and access mode (ReadWriteOnce) are appropriate for log storage.

tools/deployment/package-helm/test.sh (3)

59-59: LGTM! Added support for staged archives and streams.

The addition of staged-archives and staged-streams directories aligns with the PR's support for S3 storage with staged volumes, as mentioned in the AI summary.


63-63: LGTM! API server log directory added.

The api_server log directory creation is consistent with the new API server PV/PVC templates introduced in this PR.


98-100: LGTM! Port mapping for API server added.

The port mapping for 30301 is consistent with the API server deployment support, as configured in values.yaml (line 129).

tools/deployment/package-helm/templates/compression-worker-tmp-pv.yaml (1)

1-8: LGTM! Consistent PV migration for compression worker.

The migration to createStaticPv and removal of nodeRole follows the same pattern as other PV templates, enabling distributed deployment support.

tools/deployment/package-helm/templates/results-cache-logs-pv.yaml (1)

1-8: LGTM! Consistent PV migration for results cache logs.

The change follows the established pattern for distributed deployment support.

tools/deployment/package-helm/templates/queue-logs-pv.yaml (1)

1-8: LGTM! Consistent PV migration for queue logs.

The change follows the established pattern for distributed deployment support.

tools/deployment/package-helm/templates/redis-logs-pv.yaml (1)

1-8: LGTM! Consistent PV migration for Redis logs.

The change follows the established pattern for distributed deployment support.

tools/deployment/package-helm/templates/results-cache-data-pv.yaml (1)

1-8: LGTM! Consistent PV migration for results cache data.

The change follows the established pattern for distributed deployment support.

tools/deployment/package-helm/templates/database-logs-pv.yaml (1)

1-8: LGTM! Clean migration to static PV provisioning.

The migration from clp.createLocalPv to clp.createStaticPv and removal of the nodeRole parameter aligns with the PR's objective to support distribution-aware PV provisioning. This change is consistent with the broader pattern applied across all PV templates.

tools/deployment/package-helm/templates/database-data-pv.yaml (1)

1-8: LGTM! Consistent with PV migration pattern.

The migration to clp.createStaticPv is consistent with the broader refactoring effort and supports the new distribution-aware provisioning model.

tools/deployment/package-helm/templates/api-server-service.yaml (1)

1-18: LGTM! Well-structured service configuration.

The NodePort service configuration is appropriate for external API server access. The use of named targetPort and conditional rendering based on .Values.clpConfig.api_server follows Helm best practices.

tools/deployment/package-helm/templates/api-server-logs-pv.yaml (1)

1-10: LGTM! Proper static PV configuration for API server logs.

The static PV configuration follows the new pattern established in this PR and is appropriately gated by the .Values.clpConfig.api_server flag. The capacity and access mode are suitable for log storage.

tools/deployment/package-helm/templates/compression-scheduler-user-logs-pv.yaml (1)

1-8: LGTM! Consistent PV migration.

The migration to clp.createStaticPv is consistent with the chart-wide refactoring and properly removes the nodeRole parameter.

tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml (1)

1-10: PV configuration consistent with paired PVC, but verify access mode for multi-replica scenarios.

The static PV configuration is consistent with its paired PVC (query-worker-staged-streams-pvc.yaml): both use 50Gi capacity and ReadWriteOnce access mode. However, the ReadWriteOnce access mode may be problematic in multi-replica query-worker deployments (see comment on the PVC).

Ensure the access mode aligns with the deployment strategy for query-workers in distributed mode.

tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml (1)

1-10: LGTM!

The conditional PV creation for staged archives when using S3 storage follows the established pattern in the chart. The static analysis warnings are false positives—YAMLlint doesn't understand Helm's Go templating syntax, which is only valid YAML after rendering.

tools/deployment/package-helm/templates/query-worker-logs-pv.yaml (1)

1-8: LGTM!

The migration from clp.createLocalPv to clp.createStaticPv and removal of nodeRole is consistent with the broader chart refactoring for multi-node support.

tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml (1)

1-12: LGTM!

The retention period conditional check and migration to clp.createStaticPv are correctly implemented, consistent with the broader chart updates.

tools/deployment/package-helm/templates/webui-deployment.yaml (1)

83-126: LGTM!

The conditional volume mounting for AWS config, logs input, and streams is correctly implemented and follows the established pattern across other deployment templates.

tools/deployment/package-helm/templates/api-server-deployment.yaml (1)

1-59: Well-structured API server deployment.

The deployment follows established patterns: proper use of init containers for dependency ordering, centralized probe timing templates, and consistent security context configuration.

tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml (1)

79-113: LGTM!

The conditional volume mounting for AWS config and logs input aligns with the storage-type-aware approach used across other deployment templates in this PR.

tools/deployment/package-helm/templates/compression-worker-deployment.yaml (3)

9-9: Good parameterization for replica scaling.

Making replicas configurable via compressionWorker.replicas enables horizontal scaling of compression workers in multi-node deployments.


20-23: LGTM!

The scheduling configuration helper enables fine-grained control over pod placement via nodeSelector, affinity, tolerations, and topologySpreadConstraints—essential for the dedicated worker pool scenarios.


66-128: LGTM!

The conditional volume mounting correctly differentiates between filesystem (archives) and S3 (staged-archives) storage backends, with appropriate handling for AWS config and logs input volumes.

tools/deployment/package-helm/test-multi-dedicated.sh (1)

34-72: LGTM!

The dynamic Kind cluster configuration with control-plane and configurable worker nodes is well-structured. The heredoc approach for generating the cluster config is clean and maintainable.

tools/deployment/package-helm/templates/query-worker-deployment.yaml (4)

9-9: LGTM!

Configurable replica count enables proper multi-node deployment scaling.


20-23: LGTM!

The scheduling config helper integration enables per-component scheduling control and auto-tolerates control-plane taints in single-node mode.


57-83: LGTM!

The conditional volume mounts correctly handle:

  • Config mount with readOnly flag
  • Archives mount only for fs storage type
  • AWS config when configured
  • Streams for fs type vs staged-streams for s3 type

Based on learnings, the streams directory remains read-write as required for query-worker output.


101-126: LGTM!

Volume definitions align correctly with the conditional volumeMounts above.

tools/deployment/package-helm/templates/shared-data-archives-pv.yaml (1)

1-8: LGTM!

The migration to clp.createStaticPv properly delegates distribution-aware PV creation to the helper, which handles local vs hostPath volume types based on the distributed flag.

tools/deployment/package-helm/test-multi-shared.sh (1)

1-81: LGTM overall!

The test script correctly:

  • Sources common utilities for directory setup and sample download
  • Creates a multi-node Kind cluster with shared storage mounts
  • Installs the Helm chart with distributed=true and configurable replica counts
  • Waits for pods to be ready with proper timeout handling

The structure aligns well with the test-multi-dedicated.sh pattern.

tools/deployment/package-helm/templates/garbage-collector-deployment.yaml (3)

1-3: LGTM!

The conditional rendering based on retention periods is correct—the garbage collector deployment is only needed when archive or cache retention is configured.


64-91: LGTM!

Conditional volume mounts correctly gate archives and streams on their respective fs storage types, with proper AWS config support when configured.


103-124: LGTM!

Volume definitions align with the conditional mounts. The tmp emptyDir is appropriate for transient log data.

tools/deployment/package-helm/.test-common.sh (3)

72-83: LGTM!

The SAMPLE_DOWNLOAD_PID global is intentionally exported for use by sourcing scripts (e.g., test-multi-shared.sh line 78). Shellcheck SC2034 is a false positive in this context.


18-55: LGTM!

The polling loop correctly:

  • Waits for all jobs to complete
  • Waits for non-job pods to be ready
  • Uses the built-in SECONDS counter for timeout tracking
  • Provides clear status output

57-68: LGTM!

Directory creation covers all required CLP data and log paths for testing.

tools/deployment/package-helm/values.yaml (4)

21-27: LGTM!

Clear documentation of the two deployment modes and their implications for storage and scheduling.


32-62: LGTM!

The commented scheduling examples provide useful guidance for users configuring node selectors, tolerations, and topology spread constraints for worker pods.


64-68: LGTM!

The storage class configuration with clear documentation about the default local-storage class and external class usage is helpful.


127-133: LGTM!

API server configuration with sensible defaults for query result limits and polling backoff.

tools/deployment/package-helm/templates/_helpers.tpl (2)

128-156: LGTM!

The distribution-aware PV creation logic correctly:

  • Uses hostPath with DirectoryOrCreate for multi-node shared storage
  • Uses local volume with control-plane node affinity for single-node mode
  • Configures storageClassName from values

254-277: LGTM!

Clean helper templates for AWS config volume and volumeMount. Based on learnings, the hardcoded AWS config path approach aligns with the container orchestration design.

tools/deployment/package-helm/templates/configmap.yaml (5)

9-19: LGTM!

The API server config renders correctly with fallback to null when not configured. Based on learnings, the hardcoded host: "localhost" is acceptable since the API server is only accessed externally via NodePort.


22-57: LGTM!

The archive_output storage branching correctly handles:

  • fs type with directory
  • s3 type with staging directory, S3 config, and optional AWS authentication (profile or credentials with optional session token)

93-116: LGTM!

The logs_input configuration properly supports both fs and s3 types with appropriate AWS authentication options.


152-191: LGTM!

The stream_output configuration mirrors the archive_output S3 structure correctly, with proper staging directory and S3 config support.


278-292: LGTM!

The webui-server-settings correctly populates S3-related fields for stream files when using S3 storage, with proper null defaults for non-S3 configurations.

Comment on lines 317 to 344
{{- define "clp.createSchedulingConfigs" -}}
{{- $componentConfig := index .root.Values .component | default dict -}}
{{- $scheduling := $componentConfig.scheduling | default dict -}}
{{- $tolerations := $scheduling.tolerations | default list -}}
{{- if not .root.Values.distributed -}}
{{- $tolerations = append $tolerations (dict
"key" "node-role.kubernetes.io/control-plane"
"operator" "Exists"
"effect" "NoSchedule"
) -}}
{{- end -}}
{{- with $scheduling.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 2 }}
{{- end }}
{{- with $scheduling.affinity }}
affinity:
{{- toYaml . | nindent 2 }}
{{- end }}
{{- with $tolerations }}
tolerations:
{{- toYaml . | nindent 2 }}
{{- end }}
{{- with $scheduling.topologySpreadConstraints }}
topologySpreadConstraints:
{{- toYaml . | nindent 2 }}
{{- end }}
{{- end }}{{/* define "clp.createSchedulingConfigs" */}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🌐 Web query:

Do duplicate Kubernetes pod tolerations cause problems

💡 Result:

Short answer: No — duplicate tolerations are harmless but redundant. Kubernetes treats tolerations as a list used to filter node taints; duplicate entries don't change scheduling or eviction behavior (they're simply redundant). [1][2]

Notes/caveats:

  • A toleration with/without effect or tolerationSeconds can change behavior, so duplicates that differ matter; identical duplicates do not. [1][3]
  • Having redundant entries is confusing for humans and config management; best to deduplicate manifests for clarity. [2]

Sources:
[1] Kubernetes docs — Taints and Tolerations.
[2] Community/StackOverflow discussion on tolerations behavior.
[3] Kubernetes docs on NoExecute/tolerationSeconds behavior.


Consider deduplicating control-plane tolerations for clarity.

When distributed=false, the template appends a node-role.kubernetes.io/control-plane toleration. If a user also specifies this toleration in their scheduling config, duplicates will occur. While Kubernetes handles duplicate tolerations without issue, deduplicating the manifest or documenting this overlap would improve clarity and avoid confusion during configuration management.

🤖 Prompt for AI Agents
In tools/deployment/package-helm/templates/_helpers.tpl around lines 317 to 344,
the template unconditionally appends a control-plane toleration when
distributed=false which can duplicate a user-provided toleration; update the
template to check existing $tolerations for an entry with key
"node-role.kubernetes.io/control-plane" (and matching operator/effect if you
prefer) before appending, so only append the control-plane toleration when no
equivalent toleration already exists, or alternatively perform a simple dedupe
pass over $tolerations after append to remove duplicates.

Comment on lines 60 to 76
volumeMounts:
- name: {{ include "clp.volumeName" (dict
"component_category" "api-server"
"name" "logs"
) | quote }}
mountPath: "/var/log/api_server"
- name: "config"
mountPath: "/etc/clp-config.yaml"
subPath: "clp-config.yaml"
readOnly: true
{{- if eq .Values.clpConfig.stream_output.storage.type "fs" }}
- name: {{ include "clp.volumeName" (dict
"component_category" "shared-data"
"name" "streams"
) | quote }}
mountPath: "/var/data/streams"
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, check the specific lines mentioned in the api-server-deployment.yaml
echo "=== API Server Deployment (lines 60-76) ==="
sed -n '60,76p' tools/deployment/package-helm/templates/api-server-deployment.yaml

echo -e "\n=== Full api-server-deployment.yaml to understand overall structure ==="
wc -l tools/deployment/package-helm/templates/api-server-deployment.yaml

echo -e "\n=== Search for stream/s3/aws references in api-server-deployment.yaml ==="
grep -i "stream\|s3\|aws" tools/deployment/package-helm/templates/api-server-deployment.yaml || echo "No matches found"

echo -e "\n=== Compare with webui-deployment.yaml AWS config mounts ==="
grep -A2 "awsConfigVolume\|AWS_" tools/deployment/package-helm/templates/webui-deployment.yaml | head -20

echo -e "\n=== Check all volumeMounts in api-server-deployment.yaml ==="
sed -n '1,120p' tools/deployment/package-helm/templates/api-server-deployment.yaml | grep -A100 "volumeMounts:"

Repository: y-scope/clp

Length of output: 3764


🏁 Script executed:

#!/bin/bash
# Check the full api-server-deployment.yaml structure
echo "=== Full api-server-deployment.yaml ==="
cat -n tools/deployment/package-helm/templates/api-server-deployment.yaml

echo -e "\n=== Check if API server code references stream storage access ==="
find . -type f -name "*.py" -o -name "*.cpp" -o -name "*.hpp" | xargs grep -l "api_server" | head -5

echo -e "\n=== Search for stream storage access in codebase ==="
rg -l "stream.*storage|storage.*stream" --type=py -i | head -10

Repository: y-scope/clp

Length of output: 5343


🏁 Script executed:

#!/bin/bash
# Find API server source code
echo "=== Search for api_server source code ==="
find . -type f \( -name "*.rs" -o -name "*.py" \) | xargs grep -l "fn main\|api_server\|api-server" | grep -E "(src|components)" | head -10

# Look for extract_stream_task to understand stream access patterns
echo -e "\n=== Check extract_stream_task.py to understand stream access ==="
head -50 components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py

# Check if API server is Python or Rust
echo -e "\n=== Search for api_server references in codebase ==="
rg -n "api_server|api-server" --type=py -C2 components/ | head -40

# Look at clp_config.py to understand configuration structure
echo -e "\n=== Check stream_output configuration in clp_config.py ==="
grep -A10 "stream_output" components/clp-py-utils/clp_py_utils/clp_config.py | head -30

Repository: y-scope/clp

Length of output: 7333


🏁 Script executed:

#!/bin/bash
# Examine the API server Rust code
echo "=== API Server main.rs ==="
cat components/api-server/src/bin/api_server.rs | head -100

echo -e "\n=== Check what storage access the API server performs ==="
rg -n "stream|storage|s3" --type=rust components/api-server/src/ | head -20

# Check if API server reads config file and S3 paths
echo -e "\n=== Check config handling in API server ==="
rg -n "config\|Config" --type=rust components/api-server/src/bin/api_server.rs | head -20

Repository: y-scope/clp

Length of output: 5447


Add AWS configuration for S3 stream access.

The API server's Rust code supports S3 stream storage (see client.rs lines 197–205 where it calls fetch_results_from_s3), but the deployment only mounts the streams volume when stream_output.storage.type is "fs". When using S3 storage, the API server lacks AWS credentials and configuration needed to access the S3 bucket. Add AWS environment variables and/or config volume mounts (similar to the webui deployment) to handle the S3 case.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 61-61: too many spaces inside braces

(braces)


[error] 64-64: too many spaces inside braces

(braces)

🤖 Prompt for AI Agents
In tools/deployment/package-helm/templates/api-server-deployment.yaml around
lines 60 to 76, the deployment only mounts the streams volume for fs storage but
does not provide AWS credentials/config when
.Values.clpConfig.stream_output.storage.type == "s3"; add the same AWS
environment variables and/or config volumeMounts used in the webui deployment
(e.g., AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION and mounting kube
secrets/config for shared AWS config if present) gated by a conditional for the
"s3" storage type so the API server has credentials and config to call
fetch_results_from_s3.

Comment on lines 1 to 9
{{- if eq .Values.clpConfig.archive_output.storage.type "s3" }}
{{- include "clp.createPvc" (dict
"root" .
"component_category" "compression-worker"
"name" "staged-archives"
"capacity" "50Gi"
"accessModes" (list "ReadWriteOnce")
) }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider making staged-archives capacity configurable.

The conditional PVC creation for S3 storage is correct. However, the hardcoded 50Gi capacity on Line 6 may not suit all deployment sizes. Archive staging requirements can vary significantly based on workload characteristics.

🔎 Proposed refactor to make capacity configurable

Add a configurable capacity value in values.yaml:

compressionWorker:
  stagedArchives:
    capacity: "50Gi"

Then update the template:

 {{- if eq .Values.clpConfig.archive_output.storage.type "s3" }}
 {{- include "clp.createPvc" (dict
   "root" .
   "component_category" "compression-worker"
   "name" "staged-archives"
-  "capacity" "50Gi"
+  "capacity" .Values.compressionWorker.stagedArchives.capacity
   "accessModes" (list "ReadWriteOnce")
 ) }}
 {{- end }}

Would you like me to open an issue to track making PVC capacities configurable across the chart?

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

🤖 Prompt for AI Agents
In
tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
around lines 1 to 9, the PVC capacity is hardcoded to "50Gi"; make this value
configurable via values.yaml and reference that Helm value in the template. Add
a values.yaml entry (e.g. compressionWorker.stagedArchives.capacity with a
default "50Gi") and update the template to use
.Values.compressionWorker.stagedArchives.capacity (with a fallback default if
desired) instead of the literal "50Gi" so operators can override capacity per
deployment.

Comment on lines 1 to 3
{{- $archiveRetention := .Values.clpConfig.archive_output.retention_period }}
{{- $cacheRetention := .Values.clpConfig.results_cache.retention_period }}
{{- if or $archiveRetention $cacheRetention }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

LGTM! Improved conditional logic readability.

The introduction of local variables ($archiveRetention and $cacheRetention) enhances readability while maintaining the same conditional logic. The PVC is appropriately created when either retention period is configured.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 2-2: too many spaces inside braces

(braces)


[error] 3-3: too many spaces inside braces

(braces)

🤖 Prompt for AI Agents
tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml around
lines 1 to 3: no change required — the refactor introducing local variables
$archiveRetention and $cacheRetention keeps the conditional logic intact and
improves readability; leave the new variable assignments and the existing if-or
conditional as-is.

Comment on lines 1 to 9
{{- if eq .Values.clpConfig.stream_output.storage.type "s3" }}
{{- include "clp.createPvc" (dict
"root" .
"component_category" "query-worker"
"name" "staged-streams"
"capacity" "50Gi"
"accessModes" (list "ReadWriteOnce")
) }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for query-worker replica configuration in Helm values
rg -n "queryWorker.*replicas" --type=yaml

Repository: y-scope/clp

Length of output: 172


🏁 Script executed:

#!/bin/bash
# Check query-worker deployment template for staged-streams volume mounting
cat -n tools/deployment/package-helm/templates/query-worker-deployment.yaml | head -100

Repository: y-scope/clp

Length of output: 4728


🏁 Script executed:

#!/bin/bash
# Search for all references to staged-streams in the Helm templates
rg -n "staged-streams" tools/deployment/package-helm/templates/ --type=yaml

Repository: y-scope/clp

Length of output: 883


🏁 Script executed:

#!/bin/bash
# Check the default values for queryWorker replicas
rg -A 5 "queryWorker:" tools/deployment/package-helm/values.yaml | head -30

Repository: y-scope/clp

Length of output: 196


🏁 Script executed:

#!/bin/bash
# Check the rest of the query-worker deployment volumes section
sed -n '100,140p' tools/deployment/package-helm/templates/query-worker-deployment.yaml

Repository: y-scope/clp

Length of output: 1095


🏁 Script executed:

#!/bin/bash
# Check for any validation or comments about replica scaling in values.yaml
rg -B 5 -A 10 "queryWorker:" tools/deployment/package-helm/values.yaml

Repository: y-scope/clp

Length of output: 484


🏁 Script executed:

#!/bin/bash
# Check for any documentation or comments about S3 storage scaling requirements
rg -i "s3.*replica|replica.*s3" tools/deployment/package-helm/ --type=yaml --type=md

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

#!/bin/bash
# Check if there's any StatefulSet configuration for query-worker
fd "query-worker" tools/deployment/package-helm/templates/ --type f | xargs grep -l "StatefulSet"

Repository: y-scope/clp

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check the default storage type for stream_output
rg -A 5 "stream_output:" tools/deployment/package-helm/values.yaml | head -20

Repository: y-scope/clp

Length of output: 280


ReadWriteOnce access mode incompatible with multi-replica query-worker deployments using S3 storage.

The staged-streams PVC uses ReadWriteOnce access mode, which prevents multiple query-worker pods from mounting the volume simultaneously. This blocks multi-replica deployments when S3 storage is configured (stream_output.storage.type = "s3"). The Helm chart allows configuring queryWorker.replicas > 1 without preventing this misconfiguration or providing an alternative staging volume strategy.

For deployments using S3 storage with multiple query-worker replicas:

  • Consider ReadWriteMany access mode if your storage backend supports it (e.g., NFS, CephFS)
  • Alternatively, use a StatefulSet with volumeClaimTemplates to provision per-replica staging volumes
🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

🤖 Prompt for AI Agents
In tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
around lines 1-9 the PVC is created with accessMode ReadWriteOnce which prevents
multiple query-worker replicas from mounting the same volume when
stream_output.storage.type == "s3"; update the template to detect multi-replica
deployments (check .Values.queryWorker.replicas > 1) and: either switch
accessModes to ReadWriteMany when the cluster/storage supports it (add a guard
and document that backend must support RWX), or skip creating a shared PVC and
instead use per-replica provisioning (move staging to a StatefulSet with
volumeClaimTemplates or generate per-pod PVC names) so each replica gets its own
volume; also add a values.yaml flag or validation and update chart docs to
explain RWX requirement or the per-replica alternative.

Comment on lines 62 to 70
{{- with .Values.clpConfig.stream_output.storage }}
{{- if and (eq .type "s3") (eq .s3_config.aws_authentication.type "credentials") }}
- name: "AWS_ACCESS_KEY_ID"
value: {{ .s3_config.aws_authentication.credentials.access_key_id | quote }}
- name: "AWS_SECRET_ACCESS_KEY"
value: {{ .s3_config.aws_authentication.credentials.secret_access_key | quote }}
{{- end }}{{/* if and (eq .type "s3")
(eq .s3_config.aws_authentication.type "credentials") */}}
{{- end }}{{/* with .Values.clpConfig.stream_output.storage */}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using Kubernetes Secrets for AWS credentials.

Injecting AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY directly from values exposes credentials in the rendered pod spec and Helm release metadata. Consider referencing a Kubernetes Secret instead:

🔎 Suggested approach using Secrets
             {{- if and (eq .type "s3") (eq .s3_config.aws_authentication.type "credentials") }}
             - name: "AWS_ACCESS_KEY_ID"
-              value: {{ .s3_config.aws_authentication.credentials.access_key_id | quote }}
+              valueFrom:
+                secretKeyRef:
+                  name: {{ include "clp.fullname" $ }}-aws-credentials
+                  key: "access_key_id"
             - name: "AWS_SECRET_ACCESS_KEY"
-              value: {{ .s3_config.aws_authentication.credentials.secret_access_key | quote }}
+              valueFrom:
+                secretKeyRef:
+                  name: {{ include "clp.fullname" $ }}-aws-credentials
+                  key: "secret_access_key"
             {{- end }}

This requires creating a corresponding Secret resource in the chart.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tools/deployment/package-helm/templates/webui-deployment.yaml around lines 62
to 70, the chart currently injects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
directly from values into the pod spec; change this to reference a Kubernetes
Secret instead. Create a Secret in the chart (templated from values or
referenced by name), and replace the direct value env entries with env entries
that use valueFrom.secretKeyRef (or envFrom) pointing to that Secret and the
appropriate keys; ensure the Secret name and key names are configurable via
values.yaml and update chart README/values.yaml to document how to supply the
credentials securely.

Comment on lines 1 to 11
#!/usr/bin/env bash

# Multi-node cluster test with dedicated worker nodes for each worker type
# Demonstrates nodeSelector scheduling with separate node pools
#
# To clean up after running:
# kind delete cluster --name "${CLUSTER_NAME}"
# rm -rf /tmp/clp

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${script_dir}/.test-common.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add set -euo pipefail for script robustness.

The script lacks strict error handling. Adding set -euo pipefail after the shebang would cause the script to fail fast on errors, undefined variables, or pipeline failures.

🔎 Suggested fix
 #!/usr/bin/env bash
+set -euo pipefail
 
 # Multi-node cluster test with dedicated worker nodes for each worker type
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env bash
# Multi-node cluster test with dedicated worker nodes for each worker type
# Demonstrates nodeSelector scheduling with separate node pools
#
# To clean up after running:
# kind delete cluster --name "${CLUSTER_NAME}"
# rm -rf /tmp/clp
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${script_dir}/.test-common.sh"
#!/usr/bin/env bash
set -euo pipefail
# Multi-node cluster test with dedicated worker nodes for each worker type
# Demonstrates nodeSelector scheduling with separate node pools
#
# To clean up after running:
# kind delete cluster --name "${CLUSTER_NAME}"
# rm -rf /tmp/clp
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "${script_dir}/.test-common.sh"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 11-11: Not following: ./.test-common.sh was not specified as input (see shellcheck -x).

(SC1091)

🤖 Prompt for AI Agents
In tools/deployment/package-helm/test-multi-dedicated.sh around lines 1 to 11,
the script is missing strict error handling; add a line with "set -euo pipefail"
immediately after the shebang so the script fails fast on errors, treats unset
variables as errors, and makes pipelines fail if any command fails; ensure this
line is placed before any variable usage or sourcing so it applies to the whole
script.

junhaoliao and others added 2 commits January 7, 2026 15:22
# Conflicts:
#	tools/deployment/package-helm/Chart.yaml
#	tools/deployment/package-helm/templates/api-server-deployment.yaml
#	tools/deployment/package-helm/templates/api-server-logs-pv.yaml
#	tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml
#	tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
#	tools/deployment/package-helm/templates/configmap.yaml
#	tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
#	tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
#	tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
#	tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
#	tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
#	tools/deployment/package-helm/templates/shared-data-archives-pv.yaml
#	tools/deployment/package-helm/templates/shared-data-streams-pv.yaml
#	tools/deployment/package-helm/test.sh
#	tools/deployment/package-helm/values.yaml
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/deployment/package-helm/templates/shared-data-streams-pv.yaml (1)

1-9: Ensure hostPath with ReadWriteMany is properly documented or validated.

In distributed mode, this template uses hostPath with the ReadWriteMany access mode. Standard Kubernetes hostPath volumes don't support ReadWriteMany; this only works if the path points to actual shared storage (NFS/CephFS) mounted at the same location on all nodes.

The clp.createStaticPv template (lines 142–145) applies hostPath when distributed=true but does not validate against using ReadWriteMany with hostPath. Consider adding a validation check or clearer documentation in the chart README to prevent misconfigurations where users might specify ReadWriteMany without providing shared storage.

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/values.yaml:
- Around line 21-27: Add a Helm-side validation that checks .Values.distributed
and fails release if any worker replica is >1 in single-node mode: when
.Values.distributed == false, assert .Values.compressionWorker.replicas == 1 and
.Values.queryWorker.replicas == 1 (or any other local-PV-backed worker replicas)
and call the Helm fail function with a clear message. Implement this as a small
template (e.g., templates/validate-values.yaml or in templates/_helpers.tpl)
that runs at render time using the Helm fail function and references the unique
keys distributed, compressionWorker.replicas, and queryWorker.replicas so users
cannot set inconsistent values.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5922b4c and e79a9b0.

📒 Files selected for processing (6)
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-archives-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pv.yaml
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-04T03:31:55.239Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1690
File: tools/deployment/package/docker-compose-all.yaml:424-427
Timestamp: 2025-12-04T03:31:55.239Z
Learning: In tools/deployment/package/docker-compose-all.yaml, the query-worker service writes to /var/data/streams (CLP_STREAM_OUTPUT_DIR_HOST mount), so this directory must remain read-write and should not be mounted with :ro flag.

Applied to files:

  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: check-generated
🔇 Additional comments (6)
tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml (1)

2-9: Template correctly handles both deployment modes.

The clp.createStaticPv template properly implements the distributed flag:

  • Distributed mode: uses hostPath with DirectoryOrCreate
  • Single-node mode: uses local storage with control-plane node affinity

The change appropriately removes the nodeRole parameter since the new template handles node selection internally. All other PV templates have been consistently migrated to this template.

tools/deployment/package-helm/values.yaml (2)

32-63: LGTM! Well-structured worker configuration.

The default values are safe for both deployment modes, and the commented scheduling examples provide clear guidance for users implementing distributed deployments. The inclusion of nodeSelector, tolerations, and topologySpreadConstraints covers the essential scheduling primitives.


64-69: StorageClass template correctly implements documented behaviour.

The storage-class.yaml template properly creates a StorageClass with WaitForFirstConsumer binding mode only when storageClassName is set to "local-storage", exactly as described in the values.yaml comment. The conditional logic and configuration are correct.

tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml (1)

4-11: Verify node affinity for single-node mode.

The removal of nodeRole: "control-plane" is consistent with the refactoring, but in single-node mode (distributed=false), the PV must still be bound to the control-plane node. Ensure the clp.createStaticPv template adds appropriate nodeAffinity when distributed=false.

This verification can use the same script from the previous comment to inspect the template definition in _helpers.tpl.

tools/deployment/package-helm/templates/shared-data-archives-pv.yaml (1)

1-9: Same ReadWriteMany + hostPath concern as shared-data-streams-pv.yaml.

This PV uses ReadWriteMany with hostPath, requiring actual shared storage (NFS/CephFS) in distributed mode. Since archives are the primary data store accessed by multiple workers, ensuring proper shared storage setup is critical for data integrity.

Refer to the verification script in the comment for shared-data-streams-pv.yaml to check the template implementation.

tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml (1)

1-9: LGTM! Appropriate access mode for S3 staging.

The use of ReadWriteOnce is correct for per-worker S3 staging storage. The removal of nodeRole: "worker" aligns with the distributed deployment model where workers can be scheduled on any node.

Based on learnings, query-worker requires write access to streams storage, which is satisfied by the staging directory configuration.

Comment on lines 21 to 27
# Deployment mode:
# - distributed=false: Single-node deployment. PVs use local volumes bound to one node. Pods
# automatically tolerate control-plane taints. Only works with worker replicas=1.
# - distributed=true: Multi-node deployment. PVs use hostPath without node affinity, assuming
# unmanaged shared storage (e.g., NFS/CephFS mounted via /etc/fstab) at the same path on all
# nodes.
distributed: false
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for validation of replicas constraint when distributed=false

echo "=== Search for validation logic related to distributed and replicas ==="
rg -n --type=yaml -C3 'distributed.*replicas|replicas.*distributed' tools/deployment/package-helm/templates/

echo ""
echo "=== Search for fail/required statements related to replicas ==="
rg -n --type=yaml -C3 'fail.*replicas|required.*replicas' tools/deployment/package-helm/templates/

echo ""
echo "=== Search for any validation or assertion patterns ==="
rg -n --type=yaml -C2 'if.*distributed|fail|required|assert' tools/deployment/package-helm/templates/ | head -100

echo ""
echo "=== List all template files to understand structure ==="
find tools/deployment/package-helm/templates -type f -name "*.yaml" -o -name "*.yml" | sort

Repository: y-scope/clp

Length of output: 4826


🏁 Script executed:

#!/bin/bash
# Check how replicas are configured in worker deployments

echo "=== Compression Worker Deployment ==="
cat -n tools/deployment/package-helm/templates/compression-worker-deployment.yaml | head -50

echo ""
echo "=== Query Worker Deployment ==="
cat -n tools/deployment/package-helm/templates/query-worker-deployment.yaml | head -50

echo ""
echo "=== Check if there are any validating webhook or similar in the chart ==="
fd -e yaml -e yml . tools/deployment/package-helm/ | xargs grep -l "validate\|webhook\|admission" 2>/dev/null || echo "No validation webhooks found"

echo ""
echo "=== Check values.yaml for full replica configuration ==="
rg -n 'replicas|distributed' tools/deployment/package-helm/values.yaml | head -30

Repository: y-scope/clp

Length of output: 5221


Add validation to enforce replicas=1 when distributed=false.

The comment correctly states that single-node mode "only works with worker replicas=1," but there is no validation enforcing this constraint in the Helm templates. Users can override compressionWorker.replicas or queryWorker.replicas to values greater than 1, causing pod scheduling failures or data corruption with local persistent volumes.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/values.yaml around lines 21 - 27, Add a
Helm-side validation that checks .Values.distributed and fails release if any
worker replica is >1 in single-node mode: when .Values.distributed == false,
assert .Values.compressionWorker.replicas == 1 and .Values.queryWorker.replicas
== 1 (or any other local-PV-backed worker replicas) and call the Helm fail
function with a clear message. Implement this as a small template (e.g.,
templates/validate-values.yaml or in templates/_helpers.tpl) that runs at render
time using the Helm fail function and references the unique keys distributed,
compressionWorker.replicas, and queryWorker.replicas so users cannot set
inconsistent values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how in a Helm chart?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junhaoliao, in a Helm chart, you can add validation using the fail function which stops the chart rendering with an error message. Here are two approaches:

Option 1: Add validation in _helpers.tpl

Add this template to tools/deployment/package-helm/templates/_helpers.tpl:

{{/*
Validate that replicas=1 when distributed=false
*/}}
{{- define "clp.validateDistributedMode" -}}
{{- if not .Values.distributed }}
  {{- if gt (int .Values.compressionWorker.replicas) 1 }}
    {{- fail "compressionWorker.replicas must be 1 when distributed=false (single-node mode)" }}
  {{- end }}
  {{- if gt (int .Values.queryWorker.replicas) 1 }}
    {{- fail "queryWorker.replicas must be 1 when distributed=false (single-node mode)" }}
  {{- end }}
  {{- if gt (int .Values.reducer.replicas) 1 }}
    {{- fail "reducer.replicas must be 1 when distributed=false (single-node mode)" }}
  {{- end }}
{{- end }}
{{- end -}}

Then invoke it in any template file (e.g., at the top of compression-worker-deployment.yaml):

{{- include "clp.validateDistributedMode" . }}
---
apiVersion: apps/v1
kind: Deployment
...

Option 2: Create a dedicated validation file

Create tools/deployment/package-helm/templates/validate-values.yaml:

{{- if not .Values.distributed }}
  {{- if or (gt (int .Values.compressionWorker.replicas) 1) (gt (int .Values.queryWorker.replicas) 1) (gt (int .Values.reducer.replicas) 1) }}
    {{- fail "When distributed=false, all worker replicas must be 1. Single-node mode does not support multiple replicas due to local PV constraints." }}
  {{- end }}
{{- end }}

How it works:

  • During helm template, helm install, or helm upgrade, Helm renders all templates
  • The fail function immediately stops rendering and displays the error message
  • Users will see the error before any resources are created

Example error output:

Error: template: clp/templates/validate-values.yaml:3:5: executing "clp/templates/validate-values.yaml" at <fail "When distributed=false, all worker replicas must be 1...">: error calling fail: When distributed=false, all worker replicas must be 1. Single-node mode does not support multiple replicas due to local PV constraints.

Option 2 is cleaner since it keeps validation logic separate and runs automatically without needing to be explicitly included in every template.


🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

@junhaoliao junhaoliao requested review from gibber9809 and removed request for gibber9809 January 8, 2026 19:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/test-multi-dedicated.sh:
- Line 102: The wait invocation uses an unquoted variable which can cause
word-splitting or globbing; update the wait call that references
SAMPLE_DOWNLOAD_PID to quote the variable (i.e., use the quoted form of
SAMPLE_DOWNLOAD_PID in the wait command) so the PID is passed as a single
argument and safe from splitting or glob expansion.

In @tools/deployment/package-helm/test-multi-shared.sh:
- Line 81: The wait invocation uses an unquoted variable; update the wait
command to use a quoted expansion of SAMPLE_DOWNLOAD_PID (i.e., replace the
unquoted $SAMPLE_DOWNLOAD_PID with a quoted variable expansion) so it matches
the pattern used in test-multi-dedicated.sh and prevents word-splitting or
empty-value issues.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e79a9b0 and 203811b.

📒 Files selected for processing (4)
  • tools/deployment/package-helm/templates/reducer-deployment.yaml
  • tools/deployment/package-helm/test-multi-dedicated.sh
  • tools/deployment/package-helm/test-multi-shared.sh
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-07-07T17:43:04.349Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Applied to files:

  • tools/deployment/package-helm/test-multi-dedicated.sh
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

  • tools/deployment/package-helm/test-multi-dedicated.sh
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • tools/deployment/package-helm/test-multi-dedicated.sh
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • tools/deployment/package-helm/test-multi-dedicated.sh
📚 Learning: 2025-07-01T14:52:02.418Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Applied to files:

  • tools/deployment/package-helm/test-multi-dedicated.sh
🧬 Code graph analysis (2)
tools/deployment/package-helm/test-multi-shared.sh (2)
tools/deployment/package-helm/.test-common.sh (3)
  • create_clp_directories (58-68)
  • download_samples (72-83)
  • wait_for_pods (18-55)
tools/deployment/package-helm/test.sh (1)
  • wait_for_pods (17-54)
tools/deployment/package-helm/test-multi-dedicated.sh (2)
tools/deployment/package-helm/.test-common.sh (3)
  • create_clp_directories (58-68)
  • download_samples (72-83)
  • wait_for_pods (18-55)
tools/deployment/package-helm/test.sh (1)
  • wait_for_pods (17-54)
🪛 Shellcheck (0.11.0)
tools/deployment/package-helm/test-multi-shared.sh

[info] 11-11: Not following: ./.test-common.sh was not specified as input (see shellcheck -x).

(SC1091)


[info] 81-81: Double quote to prevent globbing and word splitting.

(SC2086)

tools/deployment/package-helm/test-multi-dedicated.sh

[info] 11-11: Not following: ./.test-common.sh was not specified as input (see shellcheck -x).

(SC1091)


[info] 102-102: Double quote to prevent globbing and word splitting.

(SC2086)

🪛 YAMLlint (1.37.1)
tools/deployment/package-helm/templates/reducer-deployment.yaml

[error] 9-9: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: check-generated
🔇 Additional comments (6)
tools/deployment/package-helm/templates/reducer-deployment.yaml (1)

9-9: LGTM! Parameterisation and scheduling configuration align well with distributed deployment support.

The changes enable configurable reducer replicas and integrate the scheduling configuration helper, allowing flexible node placement in multi-node deployments.

Also applies to: 23-26

tools/deployment/package-helm/values.yaml (2)

21-79: Excellent configuration structure with clear documentation.

The distributed flag documentation clearly explains the deployment modes, and the commented scheduling examples provide helpful guidance for users configuring multi-node deployments. The consistent structure across worker types makes the configuration intuitive.


80-85: Clear storage configuration with sensible defaults.

The storage class configuration provides good flexibility while maintaining a sensible default that works out-of-the-box.

tools/deployment/package-helm/test-multi-dedicated.sh (2)

76-89: Node labelling logic is correct for the test scenario.

The logic correctly divides worker nodes into compression and query pools. The assumption that Kind creates the exact number of requested nodes is reasonable for a test script.


1-11: The script already has error handling through the sourced .test-common.sh file (lines 5-7 of that file set errexit, nounset, and pipefail as separate commands). CLP test scripts follow the pattern of centralizing error handling in the common utilities file rather than duplicating it in each individual script, so no changes are needed.

tools/deployment/package-helm/test-multi-shared.sh (1)

1-11: Verify error handling pattern in test scripts.

Like test-multi-dedicated.sh, this script lacks explicit error handling. Please apply consistent error handling patterns across all test scripts.

This verification applies to both new multi-node test scripts. See the verification request in test-multi-dedicated.sh for the script to check existing patterns.

--set "queryWorker.scheduling.nodeSelector.yscope\.io/nodeType=query" \
--set "reducer.replicas=${REDUCER_REPLICAS}"

wait $SAMPLE_DOWNLOAD_PID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Quote the variable to prevent potential issues.

The $SAMPLE_DOWNLOAD_PID variable should be quoted to prevent word splitting and globbing, even though it contains a numeric value.

📝 Proposed fix
-wait $SAMPLE_DOWNLOAD_PID
+wait "$SAMPLE_DOWNLOAD_PID"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wait $SAMPLE_DOWNLOAD_PID
wait "$SAMPLE_DOWNLOAD_PID"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 102-102: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/test-multi-dedicated.sh at line 102, The wait
invocation uses an unquoted variable which can cause word-splitting or globbing;
update the wait call that references SAMPLE_DOWNLOAD_PID to quote the variable
(i.e., use the quoted form of SAMPLE_DOWNLOAD_PID in the wait command) so the
PID is passed as a single argument and safe from splitting or glob expansion.

--set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
--set "reducer.replicas=${REDUCER_REPLICAS}"

wait $SAMPLE_DOWNLOAD_PID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Quote the variable to prevent potential issues.

The $SAMPLE_DOWNLOAD_PID variable should be quoted, matching the suggested fix in test-multi-dedicated.sh.

📝 Proposed fix
-wait $SAMPLE_DOWNLOAD_PID
+wait "$SAMPLE_DOWNLOAD_PID"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wait $SAMPLE_DOWNLOAD_PID
wait "$SAMPLE_DOWNLOAD_PID"
🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 81-81: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/test-multi-shared.sh at line 81, The wait
invocation uses an unquoted variable; update the wait command to use a quoted
expansion of SAMPLE_DOWNLOAD_PID (i.e., replace the unquoted
$SAMPLE_DOWNLOAD_PID with a quoted variable expansion) so it matches the pattern
used in test-multi-dedicated.sh and prevents word-splitting or empty-value
issues.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/.test-common.sh:
- Around line 18-55: In wait_for_pods(), the current timeout check happens after
potentially long kubectl wait calls and stderr is fully suppressed; change the
loop to compute remaining time before calling kubectl wait (e.g.,
remaining=timeout_seconds-SECONDS) and if remaining<=0 return timeout error,
then pass a bounded per-call timeout (min(remaining, wait_timeout_seconds)) to
each kubectl wait invocation; also stop redirecting stderr to /dev/null —
capture kubectl stderr output and log it (via echo or a logger) when kubectl
fails so useful errors (connection failures, missing resources) are visible.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 203811b and 6d2375e.

📒 Files selected for processing (1)
  • tools/deployment/package-helm/.test-common.sh
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-06T10:07:04.654Z
Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.

Applied to files:

  • tools/deployment/package-helm/.test-common.sh
📚 Learning: 2025-07-07T17:43:04.349Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Applied to files:

  • tools/deployment/package-helm/.test-common.sh
🪛 Shellcheck (0.11.0)
tools/deployment/package-helm/.test-common.sh

[warning] 75-75: SAMPLE_DOWNLOAD_PID appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: check-generated
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (2)
tools/deployment/package-helm/.test-common.sh (2)

1-10: LGTM!

The script header follows CLP project conventions with separate set commands and proper use of parameter expansion for the default value.


58-67: LGTM!

The directory creation is efficient and safe, using mkdir -p with proper quoting to create the entire CLP directory structure in a single command.

Comment on lines +18 to +55
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3

echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."

# Reset bash built-in SECONDS counter
SECONDS=0

while true; do
sleep "${poll_interval_seconds}"
kubectl get pods

if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi

if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi

echo "---"
done
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Timeout logic may exceed the specified limit.

The timeout check occurs after the kubectl wait calls (lines 34-42), which can each run for up to wait_timeout_seconds. This means the actual timeout could exceed timeout_seconds by up to wait_timeout_seconds before the function returns.

Additionally, suppressing stderr with 2>/dev/null hides all errors, including legitimate issues such as connection failures or missing resources, which complicates debugging.

♻️ Suggested improvements
  1. Check timeout before waiting:
     while true; do
+        if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
+            echo "ERROR: Timed out waiting for pods to be ready"
+            return 1
+        fi
+
         sleep "${poll_interval_seconds}"
         kubectl get pods
 
         if kubectl wait job \
                 --all \
                 --for=condition=Complete \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null \
             && kubectl wait pods \
                 --all \
                 --selector='!job-name' \
                 --for=condition=Ready \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null
         then
             echo "All jobs completed and services are ready."
             return 0
         fi
 
-        if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
-            echo "ERROR: Timed out waiting for pods to be ready"
-            return 1
-        fi
-
         echo "---"
     done
  1. Capture and log kubectl errors instead of suppressing them entirely:
-        if kubectl wait job \
-                --all \
-                --for=condition=Complete \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null \
-            && kubectl wait pods \
-                --all \
-                --selector='!job-name' \
-                --for=condition=Ready \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null
+        local job_wait_output
+        local pod_wait_output
+        job_wait_output=$(kubectl wait job \
+                --all \
+                --for=condition=Complete \
+                --timeout="${wait_timeout_seconds}s" 2>&1) \
+            && pod_wait_output=$(kubectl wait pods \
+                --all \
+                --selector='!job-name' \
+                --for=condition=Ready \
+                --timeout="${wait_timeout_seconds}s" 2>&1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
}
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
echo "---"
done
}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.test-common.sh around lines 18 - 55, In
wait_for_pods(), the current timeout check happens after potentially long
kubectl wait calls and stderr is fully suppressed; change the loop to compute
remaining time before calling kubectl wait (e.g.,
remaining=timeout_seconds-SECONDS) and if remaining<=0 return timeout error,
then pass a bounded per-call timeout (min(remaining, wait_timeout_seconds)) to
each kubectl wait invocation; also stop redirecting stderr to /dev/null —
capture kubectl stderr output and log it (via echo or a logger) when kubectl
fails so useful errors (connection failures, missing resources) are visible.

Comment on lines 71 to 82
download_samples() {
echo "Downloading sample datasets..."
wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
| tar xz -C "$CLP_HOME/samples" &
SAMPLE_DOWNLOAD_PID=$!

# Generate sample log file for garbage collector testing
cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
{"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
{"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
EOF
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add error handling for background download.

The background download (lines 73-74) fails silently if wget or tar encounters errors (network issues, HTTP errors, corrupted archive, etc.). This can lead to difficult-to-debug test failures when the caller attempts to use the samples.

Regarding the static analysis warning about SAMPLE_DOWNLOAD_PID being unused: this is a false positive, as the variable is used by test scripts that source this file.

♻️ Proposed improvement to detect failures earlier

Add a validation function that can be called by sourcing scripts:

 download_samples() {
     echo "Downloading sample datasets..."
     wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
       | tar xz -C "$CLP_HOME/samples" &
     SAMPLE_DOWNLOAD_PID=$!
 
     # Generate sample log file for garbage collector testing
     cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
 {"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
 {"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
 EOF
 }
+
+# Waits for sample download to complete and validates success
+# @return {int} 0 on success, 1 on failure
+wait_for_sample_download() {
+    if [[ -z "${SAMPLE_DOWNLOAD_PID:-}" ]]; then
+        echo "ERROR: SAMPLE_DOWNLOAD_PID not set. Call download_samples first."
+        return 1
+    fi
+
+    echo "Waiting for sample download (PID ${SAMPLE_DOWNLOAD_PID}) to complete..."
+    if wait "${SAMPLE_DOWNLOAD_PID}"; then
+        echo "Sample download completed successfully."
+        return 0
+    else
+        echo "ERROR: Sample download failed with exit code $?"
+        return 1
+    fi
+}

This allows sourcing scripts to explicitly wait and check for download failures.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
download_samples() {
echo "Downloading sample datasets..."
wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
| tar xz -C "$CLP_HOME/samples" &
SAMPLE_DOWNLOAD_PID=$!
# Generate sample log file for garbage collector testing
cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
{"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
{"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
EOF
}
download_samples() {
echo "Downloading sample datasets..."
wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
| tar xz -C "$CLP_HOME/samples" &
SAMPLE_DOWNLOAD_PID=$!
# Generate sample log file for garbage collector testing
cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
{"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
{"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
EOF
}
# Waits for sample download to complete and validates success
# @return {int} 0 on success, 1 on failure
wait_for_sample_download() {
if [[ -z "${SAMPLE_DOWNLOAD_PID:-}" ]]; then
echo "ERROR: SAMPLE_DOWNLOAD_PID not set. Call download_samples first."
return 1
fi
echo "Waiting for sample download (PID ${SAMPLE_DOWNLOAD_PID}) to complete..."
if wait "${SAMPLE_DOWNLOAD_PID}"; then
echo "Sample download completed successfully."
return 0
else
echo "ERROR: Sample download failed with exit code $?"
return 1
fi
}
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 75-75: SAMPLE_DOWNLOAD_PID appears unused. Verify use (or export if used externally).

(SC2034)

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me!

Besides my other comment I'm also wondering if there might be a better name for the test scripts. As far as I can tell what test-multi-{dedicated,shared}.sh are doing is creating a cluster in a given state with some data available for testing, but these scripts aren't themselves testing anything besides being able to start the cluster. Looks like the intention is to either run some scripts for automatic testing or do some manual testing after running these setup scripts. Maybe a better name for them might be setup-multi-{dedicated,shared}-test.sh?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/templates/_helpers.tpl:
- Around line 142-155: Update the chart documentation and values.yaml comments
to explicitly call out security implications of using hostPath in distributed
mode (referencing the .root.Values.distributed flag and the hostPath/local
volume behavior in _helpers.tpl): state that hostPath grants direct host
filesystem access, recommend restricting use via RBAC, PodSecurityPolicies/PSPs
or PodSecurityAdmission, and warn about multi-tenant risks; also add a
recommended runtime validation approach (mention using an initContainer or a
pre-install/helm hook job to verify the shared mount exists at the configured
path on all nodes) and provide a concise example checklist to validate mounts
before enabling distributed mode.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d2375e and 3b60031.

📒 Files selected for processing (1)
  • tools/deployment/package-helm/templates/_helpers.tpl
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: check-generated
🔇 Additional comments (4)
tools/deployment/package-helm/templates/_helpers.tpl (4)

114-128: LGTM! Clear documentation and appropriate rename.

The rename from clp.createLocalPv to clp.createStaticPv accurately reflects the template's expanded responsibility. The documentation clearly explains the conditional behavior based on the distributed flag.


178-178: LGTM! Consistent storageClassName usage.

The configurable storageClassName properly aligns with the PV template change on Line 141, ensuring correct PV-PVC binding.


260-277: LGTM! Well-structured AWS config helpers.

The new helpers follow the established pattern for volume and volumeMount definitions. Using hostPath is appropriate here for accessing host-level AWS credentials.


305-344: LGTM! Solid scheduling configuration helper with smart defaults.

The template provides comprehensive scheduling control while automatically adding control-plane toleration in single-node mode—allowing pods to schedule on tainted control-plane nodes without manual intervention. The logic correctly:

  • Uses dynamic component lookup with safe fallback
  • Auto-injects the appropriate toleration for node-role.kubernetes.io/control-plane
  • Conditionally renders only non-empty scheduling configurations

The NoSchedule effect on Line 325 is appropriate for the most common control-plane taint scenario.

Comment on lines 142 to 155
{{- if .root.Values.distributed }}
hostPath:
path: {{ .hostPath | quote }}
type: "DirectoryOrCreate"
{{- else }}
local:
path: {{ .hostPath | quote }}
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: {{ printf "node-role.kubernetes.io/%s" .nodeRole | quote }}
operator: "Exists"
- matchExpressions:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
{{- end }}
{{- end }}{{/* if .root.Values.distributed */}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for documentation about shared storage requirements for distributed mode
echo "=== Searching for documentation about shared storage/distributed mode ==="
rg -i -C3 'shared.storage|nfs|cephfs|distributed.*storage|hostpath' --type=md || echo "No markdown matches found"
echo ""
echo "=== Searching in all files for distributed mode documentation ==="
rg -i 'distributed.*mode|shared.*storage' --type=txt --type=md || echo "No additional matches found"

Repository: y-scope/clp

Length of output: 848


🏁 Script executed:

#!/bin/bash
# Verify storageClassName is defined in values.yaml
echo "=== Checking values.yaml files for storage configuration ==="
fd -e yaml -e yml 'values' --exec cat {} \; 2>/dev/null | head -200

echo ""
echo "=== Searching for storageClassName in all YAML files ==="
rg 'storageClassName|storage:' -A2 -B2 --type=yaml

Repository: y-scope/clp

Length of output: 16416


🏁 Script executed:

#!/bin/bash
# Check for distributed mode documentation in README and main guides
echo "=== Checking README files ==="
fd -e md 'README|DEPLOY|INSTALL' --exec grep -l 'distributed\|shared.storage' {} \;

echo ""
echo "=== Checking deployment documentation ==="
fd -e md -path '*deploy*' --exec grep -H -A2 -B2 'distributed\|hostpath' {} \; | head -50

Repository: y-scope/clp

Length of output: 366


🏁 Script executed:

#!/bin/bash
# Check README files for distributed mode documentation
echo "=== Main README.md content (searching for distributed/storage) ==="
cat README.md | grep -A5 -B5 -i 'distributed\|shared.*storage\|multi.*host' | head -80

echo ""
echo "=== Checking deployment guides ==="
fd 'guides' --type f -e md | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && cat {} | grep -A3 -B3 -i "distributed\|shared.*storage" | head -30'

Repository: y-scope/clp

Length of output: 1147


Document security implications and recommended validation approach for distributed mode hostPath volumes.

The distributed mode documentation in values.yaml already states that hostPath volumes assume "unmanaged shared storage (e.g., NFS/CephFS mounted via /etc/fstab) at the same path on all nodes." However, two gaps remain:

  1. Security implications: hostPath volumes grant direct host filesystem access. The documentation should recommend RBAC restrictions and clarify security considerations for multi-tenant clusters.

  2. Runtime validation: Consider documenting a recommended validation approach (e.g., init container or pre-install validation) to verify that the shared storage is actually mounted on all nodes before pods attempt to use the hostPath volumes. This would catch configuration errors early rather than leading to data isolation or corruption issues at runtime.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/_helpers.tpl around lines 142 - 155,
Update the chart documentation and values.yaml comments to explicitly call out
security implications of using hostPath in distributed mode (referencing the
.root.Values.distributed flag and the hostPath/local volume behavior in
_helpers.tpl): state that hostPath grants direct host filesystem access,
recommend restricting use via RBAC, PodSecurityPolicies/PSPs or
PodSecurityAdmission, and warn about multi-tenant risks; also add a recommended
runtime validation approach (mention using an initContainer or a
pre-install/helm hook job to verify the shared mount exists at the configured
path on all nodes) and provide a concise example checklist to validate mounts
before enabling distributed mode.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/.set-up-common.sh:
- Around line 10-20: The create_clp_directories function blindly uses CLP_HOME
which may be empty and cause directories to be created in the cwd; add a
defensive check at the start of create_clp_directories to verify CLP_HOME is
non-empty (eg. test -z "${CLP_HOME:-}" or similar) and if it is empty print a
clear error to stderr referencing CLP_HOME and exit with a non-zero status so
the mkdir calls never run.
- Around line 96-139: In wait_for_pods, don’t permanently discard kubectl wait
stderr; keep the current 2>/dev/null suppression during normal polling but on
the final attempt (when SECONDS + poll_interval_seconds >= timeout_seconds or
right before returning timeout) run the kubectl wait job and kubectl wait pods
commands without the 2>/dev/null redirection so error diagnostics are preserved;
update the conditional that checks kubectl wait (inside wait_for_pods) to choose
whether to include the redirection based on remaining time and use the same
function/variable names (wait_for_pods, kubectl wait, SECONDS, timeout_seconds,
poll_interval_seconds, wait_timeout_seconds).
- Around line 37-48: Before running rm -rf, validate CLP_HOME in
prepare_environment: resolve CLP_HOME to an absolute path (realpath/readlink
-f), ensure it is non-empty, not "/" and either resides under /tmp or /var/tmp
or contains the token "clp" (so accidentally-pointing values like "/" or "/home"
are rejected); if validation fails, print an error and exit non-zero instead of
deleting; keep the rest of the flow (create_clp_directories, download_samples)
unchanged.
- Around line 22-35: The background download started in download_samples() sets
SAMPLE_DOWNLOAD_PID but doesn't verify success; after wait_for_cluster_ready
waits for that PID, add an explicit wait on SAMPLE_DOWNLOAD_PID and check its
exit status (the PID variable SAMPLE_DOWNLOAD_PID returned by download_samples);
if the wait returns non-zero log a clear error via the existing logger (or echo)
including the PID and fail the script (exit non‑zero) or surface the error so
failures are diagnosable. Ensure the check references SAMPLE_DOWNLOAD_PID and is
placed after the existing wait logic in wait_for_cluster_ready.

In @tools/deployment/package-helm/set-up-multi-dedicated-test.sh:
- Around line 36-49: The script assumes worker_nodes length matches
total_workers; add validation after populating worker_nodes to compare
${#worker_nodes[@]} against total_workers and NUM_COMPRESSION_NODES, and if
insufficient, log a clear error and exit (or wait/retry) to avoid out-of-bounds
access. Specifically, check the array length before both labeling loops
(referencing worker_nodes, NUM_COMPRESSION_NODES, total_workers) and bail with a
descriptive message if worker_nodes count < total_workers or <
NUM_COMPRESSION_NODES, or adjust the loop bounds to min(total_workers,
actual_count) to safely label only available nodes.
- Around line 51-60: Replace the fixed sleep by using Helm's built-in wait:
change the uninstall call that currently reads helm uninstall test
--ignore-not-found to use helm uninstall test --ignore-not-found --wait (and
optionally add a suitable --timeout) and remove the subsequent sleep 2 line in
set-up-multi-dedicated-test.sh so the script waits for resource cleanup reliably
before running helm install.

In @tools/deployment/package-helm/set-up-multi-shared-test.sh:
- Around line 32-39: The script uses a fixed "sleep 2" between helm uninstall
and helm install which is flaky; replace the sleep by waiting for helm to finish
uninstall and resources to be cleaned up by using helm uninstall with --wait
(and optionally --timeout) or run helm install with --wait to ensure the release
is ready; update the commands referencing helm uninstall/test and helm install
"${script_dir}" (and keep the existing --set flags like
compressionWorker.replicas, queryWorker.replicas, reducer.replicas) to use the
appropriate --wait/--timeout flags instead of the fixed sleep.

In @tools/deployment/package-helm/set-up-test.sh:
- Around line 23-26: The fixed 2-second sleep after "helm uninstall test" is
fragile; replace it with a robust wait: either add the --wait flag to the helm
uninstall call (use "helm uninstall test --wait") or implement a polling loop
that repeatedly checks for resource/release removal (e.g., loop on "helm status
test" or "kubectl get all -l release=test" until not found, with a configurable
timeout) before running "helm install test \"${script_dir}\""; update the script
to surface a clear timeout/error if the release/resources do not disappear in
time.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b60031 and 3734d5e.

📒 Files selected for processing (5)
  • tools/deployment/package-helm/.set-up-common.sh
  • tools/deployment/package-helm/set-up-multi-dedicated-test.sh
  • tools/deployment/package-helm/set-up-multi-shared-test.sh
  • tools/deployment/package-helm/set-up-test.sh
  • tools/deployment/package-helm/test.sh
💤 Files with no reviewable changes (1)
  • tools/deployment/package-helm/test.sh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-07T17:43:04.349Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
🧬 Code graph analysis (2)
tools/deployment/package-helm/set-up-multi-dedicated-test.sh (1)
tools/deployment/package-helm/.set-up-common.sh (3)
  • prepare_environment (39-48)
  • generate_kind_config (53-94)
  • wait_for_cluster_ready (142-147)
tools/deployment/package-helm/set-up-test.sh (1)
tools/deployment/package-helm/.set-up-common.sh (3)
  • prepare_environment (39-48)
  • generate_kind_config (53-94)
  • wait_for_cluster_ready (142-147)
🪛 Shellcheck (0.11.0)
tools/deployment/package-helm/set-up-multi-shared-test.sh

[info] 17-17: Not following: .set-up-common.sh was not specified as input (see shellcheck -x).

(SC1091)

tools/deployment/package-helm/set-up-multi-dedicated-test.sh

[info] 18-18: Not following: .set-up-common.sh was not specified as input (see shellcheck -x).

(SC1091)

tools/deployment/package-helm/set-up-test.sh

[info] 12-12: Not following: .set-up-common.sh was not specified as input (see shellcheck -x).

(SC1091)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: check-generated
🔇 Additional comments (5)
tools/deployment/package-helm/set-up-test.sh (1)

1-12: LGTM!

The initialization is correct and follows shell scripting best practices. The script_dir calculation, variable defaults, and sourcing pattern are all appropriate.

tools/deployment/package-helm/set-up-multi-shared-test.sh (1)

1-26: LGTM!

The initialization and configuration are well-structured with clear defaults and informative output. The additional replica configuration variables are appropriately defined.

tools/deployment/package-helm/set-up-multi-dedicated-test.sh (1)

1-28: LGTM!

The initialization follows the established pattern and provides clear configuration for the dedicated worker node setup.

tools/deployment/package-helm/.set-up-common.sh (2)

1-8: LGTM!

The script header and error-handling flags are correctly configured. The use of separate set -o commands aligns with CLP project conventions.

Based on learnings, the CLP project maintains consistency by using separate set commands rather than combining them.


50-94: LGTM!

The Kind cluster configuration generation is well-structured and correctly handles both single-node and multi-node scenarios. The heredoc approach is appropriate, and variable expansion is handled correctly.

Comment on lines 10 to 20
# Creates required directories for CLP data and logs
create_clp_directories() {
echo "Creating CLP directories at ${CLP_HOME}..."
mkdir -p "$CLP_HOME/var/"{data,log}/{database,queue,redis,results_cache} \
"$CLP_HOME/var/data/"{archives,streams,staged-archives,staged-streams} \
"$CLP_HOME/var/log/"{compression_scheduler,compression_worker,user} \
"$CLP_HOME/var/log/"{query_scheduler,query_worker,reducer} \
"$CLP_HOME/var/log/"{api_server,garbage_collector,mcp_server} \
"$CLP_HOME/var/tmp" \
"$CLP_HOME/samples"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider validating CLP_HOME.

While set -o nounset prevents unset variables, an empty CLP_HOME would create directories in the current working directory. Since all callers explicitly set CLP_HOME, this is a low-risk edge case, but a defensive check would improve robustness.

♻️ Optional improvement
 # Creates required directories for CLP data and logs
 create_clp_directories() {
+    if [[ -z "${CLP_HOME:-}" ]]; then
+        echo "ERROR: CLP_HOME must be set"
+        return 1
+    fi
     echo "Creating CLP directories at ${CLP_HOME}..."
     mkdir -p  "$CLP_HOME/var/"{data,log}/{database,queue,redis,results_cache} \
               "$CLP_HOME/var/data/"{archives,streams,staged-archives,staged-streams} \
               "$CLP_HOME/var/log/"{compression_scheduler,compression_worker,user} \
               "$CLP_HOME/var/log/"{query_scheduler,query_worker,reducer} \
               "$CLP_HOME/var/log/"{api_server,garbage_collector,mcp_server} \
               "$CLP_HOME/var/tmp" \
               "$CLP_HOME/samples"
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Creates required directories for CLP data and logs
create_clp_directories() {
echo "Creating CLP directories at ${CLP_HOME}..."
mkdir -p "$CLP_HOME/var/"{data,log}/{database,queue,redis,results_cache} \
"$CLP_HOME/var/data/"{archives,streams,staged-archives,staged-streams} \
"$CLP_HOME/var/log/"{compression_scheduler,compression_worker,user} \
"$CLP_HOME/var/log/"{query_scheduler,query_worker,reducer} \
"$CLP_HOME/var/log/"{api_server,garbage_collector,mcp_server} \
"$CLP_HOME/var/tmp" \
"$CLP_HOME/samples"
}
# Creates required directories for CLP data and logs
create_clp_directories() {
if [[ -z "${CLP_HOME:-}" ]]; then
echo "ERROR: CLP_HOME must be set"
return 1
fi
echo "Creating CLP directories at ${CLP_HOME}..."
mkdir -p "$CLP_HOME/var/"{data,log}/{database,queue,redis,results_cache} \
"$CLP_HOME/var/data/"{archives,streams,staged-archives,staged-streams} \
"$CLP_HOME/var/log/"{compression_scheduler,compression_worker,user} \
"$CLP_HOME/var/log/"{query_scheduler,query_worker,reducer} \
"$CLP_HOME/var/log/"{api_server,garbage_collector,mcp_server} \
"$CLP_HOME/var/tmp" \
"$CLP_HOME/samples"
}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 10 - 20, The
create_clp_directories function blindly uses CLP_HOME which may be empty and
cause directories to be created in the cwd; add a defensive check at the start
of create_clp_directories to verify CLP_HOME is non-empty (eg. test -z
"${CLP_HOME:-}" or similar) and if it is empty print a clear error to stderr
referencing CLP_HOME and exit with a non-zero status so the mkdir calls never
run.

Comment on lines +37 to +48
# Cleans up existing cluster and prepares environment
# @param {string} cluster_name Name of the kind cluster
prepare_environment() {
local cluster_name=$1

echo "Deleting existing cluster if present..."
kind delete cluster --name "${cluster_name}" 2>/dev/null || true

rm -rf "$CLP_HOME"
create_clp_directories
download_samples
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate CLP_HOME before destructive operation.

The rm -rf "$CLP_HOME" command at line 45 could delete unintended data if CLP_HOME is set to an incorrect value (e.g., /, /home, etc.). While the default /tmp/clp is safe and the calling scripts set it explicitly, adding validation would prevent accidents.

🛡️ Suggested safeguard
 # Cleans up existing cluster and prepares environment
 # @param {string} cluster_name Name of the kind cluster
 prepare_environment() {
     local cluster_name=$1
+    
+    # Validate CLP_HOME is set and looks safe
+    if [[ -z "${CLP_HOME:-}" ]] || [[ "$CLP_HOME" == "/" ]] || [[ "$CLP_HOME" == "/home" ]]; then
+        echo "ERROR: Invalid or unsafe CLP_HOME: ${CLP_HOME:-<unset>}"
+        return 1
+    fi
 
     echo "Deleting existing cluster if present..."
     kind delete cluster --name "${cluster_name}" 2>/dev/null || true
 
     rm -rf "$CLP_HOME"
     create_clp_directories
     download_samples
 }
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 37 - 48, Before
running rm -rf, validate CLP_HOME in prepare_environment: resolve CLP_HOME to an
absolute path (realpath/readlink -f), ensure it is non-empty, not "/" and either
resides under /tmp or /var/tmp or contains the token "clp" (so
accidentally-pointing values like "/" or "/home" are rejected); if validation
fails, print an error and exit non-zero instead of deleting; keep the rest of
the flow (create_clp_directories, download_samples) unchanged.

Comment on lines +96 to +139
# Waits for all jobs to complete and all non-job pods to be ready.
#
# @param {int} timeout_seconds Overall timeout in seconds
# @param {int} poll_interval_seconds Interval between status checks
# @param {int} wait_timeout_seconds Timeout for each kubectl wait call
# @return {int} 0 on success, 1 on timeout
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3

echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."

# Reset bash built-in SECONDS counter
SECONDS=0

while true; do
sleep "${poll_interval_seconds}"
kubectl get pods

if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi

if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi

echo "---"
done
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider preserving error messages during final wait attempt.

The polling logic is sound, but suppressing stderr with 2>/dev/null on every kubectl wait call hides potentially useful error messages. While this reduces noise during normal polling, it makes debugging harder when pods fail to become ready.

♻️ Optional improvement for better diagnostics
     while true; do
         sleep "${poll_interval_seconds}"
         kubectl get pods
 
+        # On final attempt, show errors for debugging
+        if [[ ${SECONDS} -ge $((timeout_seconds - poll_interval_seconds)) ]]; then
+            redirect=""
+        else
+            redirect="2>/dev/null"
+        fi
+
-        if kubectl wait job \
+        if eval kubectl wait job \
                 --all \
                 --for=condition=Complete \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null \
-            && kubectl wait pods \
+                --timeout="${wait_timeout_seconds}s" $redirect \
+            && eval kubectl wait pods \
                 --all \
                 --selector='!job-name' \
                 --for=condition=Ready \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null
+                --timeout="${wait_timeout_seconds}s" $redirect
         then
             echo "All jobs completed and services are ready."
             return 0
         fi

Note: This uses eval to conditionally apply redirection, which requires careful handling.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Waits for all jobs to complete and all non-job pods to be ready.
#
# @param {int} timeout_seconds Overall timeout in seconds
# @param {int} poll_interval_seconds Interval between status checks
# @param {int} wait_timeout_seconds Timeout for each kubectl wait call
# @return {int} 0 on success, 1 on timeout
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
}
# Waits for all jobs to complete and all non-job pods to be ready.
#
# @param {int} timeout_seconds Overall timeout in seconds
# @param {int} poll_interval_seconds Interval between status checks
# @param {int} wait_timeout_seconds Timeout for each kubectl wait call
# @return {int} 0 on success, 1 on timeout
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
# On final attempt, show errors for debugging
if [[ ${SECONDS} -ge $((timeout_seconds - poll_interval_seconds)) ]]; then
redirect=""
else
redirect="2>/dev/null"
fi
if eval kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" $redirect \
&& eval kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" $redirect
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 96 - 139, In
wait_for_pods, don’t permanently discard kubectl wait stderr; keep the current
2>/dev/null suppression during normal polling but on the final attempt (when
SECONDS + poll_interval_seconds >= timeout_seconds or right before returning
timeout) run the kubectl wait job and kubectl wait pods commands without the
2>/dev/null redirection so error diagnostics are preserved; update the
conditional that checks kubectl wait (inside wait_for_pods) to choose whether to
include the redirection based on remaining time and use the same
function/variable names (wait_for_pods, kubectl wait, SECONDS, timeout_seconds,
poll_interval_seconds, wait_timeout_seconds).

Comment on lines 51 to 60
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}" \
--set "distributed=true" \
--set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
--set "compressionWorker.scheduling.nodeSelector.yscope\.io/nodeType=compression" \
--set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
--set "queryWorker.scheduling.nodeSelector.yscope\.io/nodeType=query" \
--set "reducer.replicas=${REDUCER_REPLICAS}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using --wait flag instead of fixed sleep.

Consistent with the other setup scripts, replace the fixed 2-second sleep with helm uninstall --wait for more reliable resource cleanup.

♻️ Suggested improvement
 echo "Installing Helm chart..."
-helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait
 helm install test "${script_dir}" \
     --set "distributed=true" \
     --set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
     --set "compressionWorker.scheduling.nodeSelector.yscope\.io/nodeType=compression" \
     --set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
     --set "queryWorker.scheduling.nodeSelector.yscope\.io/nodeType=query" \
     --set "reducer.replicas=${REDUCER_REPLICAS}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}" \
--set "distributed=true" \
--set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
--set "compressionWorker.scheduling.nodeSelector.yscope\.io/nodeType=compression" \
--set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
--set "queryWorker.scheduling.nodeSelector.yscope\.io/nodeType=query" \
--set "reducer.replicas=${REDUCER_REPLICAS}"
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found --wait
helm install test "${script_dir}" \
--set "distributed=true" \
--set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
--set "compressionWorker.scheduling.nodeSelector.yscope\.io/nodeType=compression" \
--set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
--set "queryWorker.scheduling.nodeSelector.yscope\.io/nodeType=query" \
--set "reducer.replicas=${REDUCER_REPLICAS}"
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-dedicated-test.sh around lines 51
- 60, Replace the fixed sleep by using Helm's built-in wait: change the
uninstall call that currently reads helm uninstall test --ignore-not-found to
use helm uninstall test --ignore-not-found --wait (and optionally add a suitable
--timeout) and remove the subsequent sleep 2 line in
set-up-multi-dedicated-test.sh so the script waits for resource cleanup reliably
before running helm install.

Comment on lines 32 to 39
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}" \
--set "distributed=true" \
--set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
--set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
--set "reducer.replicas=${REDUCER_REPLICAS}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using --wait flag instead of fixed sleep.

Similar to the single-node setup, the 2-second sleep between helm uninstall and helm install could be replaced with the --wait flag for more reliable resource cleanup.

♻️ Suggested improvement
 echo "Installing Helm chart..."
-helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait
 helm install test "${script_dir}" \
     --set "distributed=true" \
     --set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
     --set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
     --set "reducer.replicas=${REDUCER_REPLICAS}"
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-shared-test.sh around lines 32 -
39, The script uses a fixed "sleep 2" between helm uninstall and helm install
which is flaky; replace the sleep by waiting for helm to finish uninstall and
resources to be cleaned up by using helm uninstall with --wait (and optionally
--timeout) or run helm install with --wait to ensure the release is ready;
update the commands referencing helm uninstall/test and helm install
"${script_dir}" (and keep the existing --set flags like
compressionWorker.replicas, queryWorker.replicas, reducer.replicas) to use the
appropriate --wait/--timeout flags instead of the fixed sleep.

Comment on lines +23 to +26
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider polling instead of fixed sleep.

The 2-second sleep between helm uninstall and helm install is a workaround to allow resource finalization. For more robust behaviour, consider polling for resource deletion or using helm uninstall --wait, which blocks until resources are deleted.

♻️ Suggested improvement
 echo "Installing Helm chart..."
-helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait
 helm install test "${script_dir}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}"
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found --wait
helm install test "${script_dir}"
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-test.sh around lines 23 - 26, The fixed
2-second sleep after "helm uninstall test" is fragile; replace it with a robust
wait: either add the --wait flag to the helm uninstall call (use "helm uninstall
test --wait") or implement a polling loop that repeatedly checks for
resource/release removal (e.g., loop on "helm status test" or "kubectl get all
-l release=test" until not found, with a configurable timeout) before running
"helm install test \"${script_dir}\""; update the script to surface a clear
timeout/error if the release/resources do not disappear in time.

@junhaoliao
Copy link
Member Author

Besides my other comment I'm also wondering if there might be a better name for the test scripts. As far as I can tell what test-multi-{dedicated,shared}.sh are doing is creating a cluster in a given state with some data available for testing, but these scripts aren't themselves testing anything besides being able to start the cluster. Looks like the intention is to either run some scripts for automatic testing or do some manual testing after running these setup scripts. Maybe a better name for them might be setup-multi-{dedicated,shared}-test.sh?

The .sh scripts are meant to be replaced with integration-tests in the future and therefore we didn't intend to write extensive tests in the scripts. that said, by looking at what the scripts currently do, i very much agree a different name can better conclude what they do. i just made a slight modification to your proposal setup- -> set-up-

I did realize how i forgot to update the tesh.sh script (which sets up a single node cluster with the chart) to use the new common.sh, and certain shared logics were out of sync because of that. i tried fixing the inconsistencies in the latest commit. could you help check again?

@junhaoliao junhaoliao requested a review from gibber9809 January 10, 2026 07:38
Rename test scripts to set-up-* to reflect they set up test
environments rather than run tests:
- test.sh -> set-up-test.sh
- test-multi-dedicated.sh -> set-up-multi-dedicated-test.sh
- test-multi-shared.sh -> set-up-multi-shared-test.sh
- .test-common.sh -> .set-up-common.sh
- Extract common functions to .set-up-common.sh:
  - prepare_environment: cluster cleanup and directory setup
  - generate_kind_config: kind cluster YAML generation
  - wait_for_cluster_ready: sample download and pod readiness
- Refactor set-up-test.sh to use common utilities instead of
  duplicating code
- Unify default cluster name to clp-test across all scripts
- Add missing port 30800 mapping to multi-node scripts
- Add missing mcp_server log directory
- Add shellcheck source directive for static analysis
- Move CLP_HOME to individual scripts for better visibility
- Remove redundant 2>/dev/null from helm uninstall with
  --ignore-not-found
- Reorder functions in common file by call order
@junhaoliao junhaoliao force-pushed the helm-proper-multi-node branch from 3734d5e to 2b0e863 Compare January 10, 2026 07:42
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/templates/_helpers.tpl:
- Around line 318-328: The template define "clp.createSchedulingConfigs"
currently unconditionally appends a control-plane toleration to $tolerations
when .root.Values.distributed_deployment is false, causing potential duplicate
tolerations; update the logic to check $tolerations for an existing entry with
key "node-role.kubernetes.io/control-plane" (and matching operator "Exists" and
effect "NoSchedule") before appending, or deduplicate $tolerations after
construction so the control-plane toleration is only present once; modify the
block that manipulates $tolerations (the $tolerations variable and the
conditional on .root.Values.distributed_deployment) to perform this existence
check or deduplication.

In @tools/deployment/package-helm/values.yaml:
- Around line 21-27: Add a Helm template validation that fails chart rendering
when distributed_deployment is false but the relevant replica counts are >1;
specifically check .Values.distributed_deployment and cast
.Values.compressionWorker.replicas to int and call fail if it is > 1 (you can
add this check into _helpers.tpl or at the top of the compression worker
deployment template), and repeat similar checks for any other charts that
require replicas=1 in single-node mode so the chart errors out early instead of
producing pods that remain Pending.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdc0207 and d642a48.

📒 Files selected for processing (2)
  • tools/deployment/package-helm/templates/_helpers.tpl
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/values.yaml
🔇 Additional comments (5)
tools/deployment/package-helm/templates/_helpers.tpl (2)

143-157: Well-structured distributed deployment branching.

The conditional logic correctly differentiates between distributed mode (hostPath with DirectoryOrCreate) and single-node mode (local volume with control-plane nodeAffinity). The DirectoryOrCreate type is appropriate for ensuring the path exists on all nodes in a distributed setup.


179-179: LGTM!

The storageClassName configuration is now consistent between PV and PVC templates.

tools/deployment/package-helm/values.yaml (3)

32-47: LGTM!

The compressionWorker configuration with replicas and commented scheduling examples provides clear guidance for users configuring multi-node deployments.


48-78: Consistent worker configuration structure.

The queryWorker and reducer blocks follow the same pattern as compressionWorker, maintaining consistency across worker types.


80-84: LGTM!

The storage configuration with clear documentation about the "local-storage" default behaviour and custom StorageClass requirements is well-structured.

Comment on lines 21 to 27
# Deployment mode:
# - distributed_deployment=false: Single-node deployment. PVs use local volumes bound to one node.
# Pods automatically tolerate control-plane taints. Only works with worker replicas=1.
# - distributed_deployment=true: Multi-node deployment. PVs use hostPath without node affinity,
# assuming unmanaged shared storage (e.g., NFS/CephFS mounted via /etc/fstab) at the same path on
# all nodes.
distributed_deployment: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding validation for replicas constraint in single-node mode.

The documentation correctly states that distributed_deployment=false only works with replicas=1, but there's no enforcement in the templates. Users could accidentally configure replicas > 1 with single-node mode, leading to pods stuck in Pending state due to PV node affinity conflicts.

Consider adding a validation check in the templates:

💡 Example validation in a template
{{- if and (not .Values.distributed_deployment) (gt (int .Values.compressionWorker.replicas) 1) }}
  {{- fail "distributed_deployment=false requires compressionWorker.replicas=1" }}
{{- end }}

This could be added to _helpers.tpl as a validation template or at the top of the relevant deployment templates.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/values.yaml around lines 21 - 27, Add a Helm
template validation that fails chart rendering when distributed_deployment is
false but the relevant replica counts are >1; specifically check
.Values.distributed_deployment and cast .Values.compressionWorker.replicas to
int and call fail if it is > 1 (you can add this check into _helpers.tpl or at
the top of the compression worker deployment template), and repeat similar
checks for any other charts that require replicas=1 in single-node mode so the
chart errors out early instead of producing pods that remain Pending.

# Conflicts:
#	tools/deployment/package-helm/Chart.yaml
#	tools/deployment/package-helm/test.sh
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/.set-up-common.sh:
- Around line 22-35: The SAMPLE_DOWNLOAD_PID global is implicitly created in
download_samples() and later consumed by wait_for_cluster_ready(), so declare
and initialize SAMPLE_DOWNLOAD_PID at script scope (near the top of the file)
with a short comment describing its purpose to make the coupling explicit;
update or add a one-line docstring/comment on wait_for_cluster_ready() to
mention it relies on SAMPLE_DOWNLOAD_PID being set by download_samples(), and
initialize the variable to a safe default (e.g., empty or 0) so checks against
it are robust.
- Around line 117-141: The loop currently sleeps at the top (sleep
"${poll_interval_seconds}") which delays the first status check; move the sleep
call to the end of the while true loop so the code runs the kubectl checks (the
kubectl wait job --all ... and kubectl wait pods --selector='!job-name' ...)
immediately on first iteration and only sleeps before the next iteration,
keeping the timeout logic (if [[ ${SECONDS} -ge ${timeout_seconds} ]]) and the
echo "---" behavior unchanged.
- Around line 144-154: The wait_for_cluster_ready function uses hardcoded
timeouts when calling wait_for_pods (300 5 5); change wait_for_cluster_ready to
accept optional parameters (e.g., pod_timeout, pod_interval, pod_retry) with
sensible defaults and pass those through to wait_for_pods, and update any
callers to use defaults when not provided; reference the function
wait_for_cluster_ready and the call to wait_for_pods to locate where to add
parameters and default handling.

In @tools/deployment/package-helm/values.yaml:
- Around line 21-28: Rename the configuration key distributed_deployment to
distributedDeployment to match the file's camelCase convention; update the
values.yaml entry and every Helm template reference (e.g., occurrences in
templates that read .Values.distributed_deployment) to use
.Values.distributedDeployment, and ensure any documentation or defaults
consuming this value are adjusted accordingly so behavior remains unchanged.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d642a48 and f5a28f9.

📒 Files selected for processing (2)
  • tools/deployment/package-helm/.set-up-common.sh
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (14)
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/values.yaml
📚 Learning: 2025-08-19T14:41:28.901Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/general.py:0-0
Timestamp: 2025-08-19T14:41:28.901Z
Learning: In the CLP codebase, prefer explicit failures over automatic directory creation in utility functions like dump_config. The user junhaoliao prefers to let file operations fail when parent directories don't exist, as this helps catch implementation errors during development rather than masking setup issues with automatic directory creation.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-05-06T10:07:04.654Z
Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-07-01T14:51:19.172Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-packages-from-source.sh:6-8
Timestamp: 2025-07-01T14:51:19.172Z
Learning: In CLP installation scripts within `components/core/tools/scripts/lib_install/`, maintain consistency with existing variable declaration patterns across platforms rather than adding individual improvements like `readonly` declarations.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2024-11-18T16:49:20.248Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-07-07T17:43:04.349Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-aarch64/build.sh:3-5
Timestamp: 2025-07-07T17:43:04.349Z
Learning: In CLP project build scripts (specifically build.sh files in docker-images directories), maintain consistency with the established pattern of using separate `set -eu` and `set -o pipefail` commands rather than combining them into `set -euo pipefail`, to ensure uniform script structure across all platform build scripts.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-10-13T03:32:19.293Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-09-04T12:26:54.788Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1170
File: tools/deployment/presto-clp/scripts/generate-metadata-filter-file.py:0-0
Timestamp: 2025-09-04T12:26:54.788Z
Learning: The `set-up-config.sh` script in tools/deployment/presto-clp/scripts/ ensures that the output directory for metadata-filter.json already exists before calling generate-metadata-filter-file.py, so directory creation in the Python script is not strictly necessary.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-08-25T06:29:59.610Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: .github/workflows/clp-core-build.yaml:294-332
Timestamp: 2025-08-25T06:29:59.610Z
Learning: In the CLP project, Bill-hbrhbr prefers a "fail fast" approach for CI workflows - allowing potential command availability issues (like getconf in musllinux) to surface through CI failures rather than preemptively adding fallback logic, as they will fix issues when they occur.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-08-25T00:45:05.464Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1242
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:39-41
Timestamp: 2025-08-25T00:45:05.464Z
Learning: Task v3.44.1 has a regression that breaks shell command processing, particularly rsync commands with brace expansion (e.g., `file.{d.ts,js,wasm}`). This causes CI failures in clp-ffi-js project (issue #110), so CLP should avoid v3.44.1 and use v3.44.0 instead, which fixes the dynamic variable bug without the shell processing regression.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-05-06T09:48:55.408Z
Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-07-01T14:52:02.418Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/docker-images/clp-env-base-musllinux_1_2-x86/build.sh:18-24
Timestamp: 2025-07-01T14:52:02.418Z
Learning: In the CLP project, consistency across platform build scripts is prioritized over defensive programming when it comes to git remote handling. All build.sh files in docker-images directories should follow the same pattern for git metadata injection.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
📚 Learning: 2025-08-20T09:01:25.825Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: integration-tests/tests/fixtures/integration_test_logs.py:60-71
Timestamp: 2025-08-20T09:01:25.825Z
Learning: For integration tests using curl downloads in PR #1100, curl's built-in timeout and retry mechanisms (with --fail and --show-error flags) are sufficient, and additional subprocess timeout parameters are not needed.

Applied to files:

  • tools/deployment/package-helm/.set-up-common.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: check-generated
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (6)
tools/deployment/package-helm/values.yaml (2)

32-79: LGTM!

The worker and reducer configuration blocks are well-structured. The commented scheduling examples provide helpful guidance for users wanting to configure node selectors, tolerations, and topology spread constraints.


80-85: LGTM!

The storage configuration with a default storageClassName of "local-storage" aligns with the chart's conditional StorageClass creation. The comment clearly explains the behaviour difference between using the default vs. a custom storage class.

tools/deployment/package-helm/.set-up-common.sh (4)

1-9: LGTM!

Good use of strict Bash error handling with errexit, nounset, and pipefail. The header comment clearly indicates this is meant to be sourced.


10-20: LGTM!

The directory creation is well-organized using brace expansion for conciseness.


37-48: LGTM!

The function correctly handles cleanup of existing clusters and delegates to helper functions. The 2>/dev/null || true pattern for the cluster deletion is appropriate for idempotent setup.


50-97: LGTM!

The Kind cluster configuration generation is well-structured. The control-plane node includes comprehensive port mappings, and worker nodes are generated dynamically with consistent mount configurations.

Comment on lines +22 to +35
# Downloads sample datasets in the background
# Sets SAMPLE_DOWNLOAD_PID global variable
download_samples() {
echo "Downloading sample datasets..."
wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
| tar xz -C "$CLP_HOME/samples" &
SAMPLE_DOWNLOAD_PID=$!

# Generate sample log file for garbage collector testing
cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
{"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
{"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
EOF
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider declaring the global variable at script scope.

SAMPLE_DOWNLOAD_PID is set here and used in wait_for_cluster_ready(). This implicit coupling could be made clearer by declaring the variable at script scope (e.g., near the top of the file with a comment) or by documenting the dependency in the function's docstring.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 22 - 35, The
SAMPLE_DOWNLOAD_PID global is implicitly created in download_samples() and later
consumed by wait_for_cluster_ready(), so declare and initialize
SAMPLE_DOWNLOAD_PID at script scope (near the top of the file) with a short
comment describing its purpose to make the coupling explicit; update or add a
one-line docstring/comment on wait_for_cluster_ready() to mention it relies on
SAMPLE_DOWNLOAD_PID being set by download_samples(), and initialize the variable
to a safe default (e.g., empty or 0) so checks against it are robust.

Comment on lines +117 to +141
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods

if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi

if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi

echo "---"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider moving sleep after the status check.

Placing sleep at the start of the loop (Line 118) adds an unnecessary delay before the first status check. Moving it to the end of the loop would allow faster completion detection when pods are already ready.

♻️ Suggested fix
     while true; do
-        sleep "${poll_interval_seconds}"
         kubectl get pods

         if kubectl wait job \
                 --all \
                 --for=condition=Complete \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null \
             && kubectl wait pods \
                 --all \
                 --selector='!job-name' \
                 --for=condition=Ready \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null
         then
             echo "All jobs completed and services are ready."
             return 0
         fi

         if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
             echo "ERROR: Timed out waiting for pods to be ready"
             return 1
         fi

         echo "---"
+        sleep "${poll_interval_seconds}"
     done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
while true; do
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
sleep "${poll_interval_seconds}"
done
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 117 - 141, The
loop currently sleeps at the top (sleep "${poll_interval_seconds}") which delays
the first status check; move the sleep call to the end of the while true loop so
the code runs the kubectl checks (the kubectl wait job --all ... and kubectl
wait pods --selector='!job-name' ...) immediately on first iteration and only
sleeps before the next iteration, keeping the timeout logic (if [[ ${SECONDS}
-ge ${timeout_seconds} ]]) and the echo "---" behavior unchanged.

Comment on lines +144 to +154
# Waits for sample download to complete and all pods to be ready
wait_for_cluster_ready() {
if wait "$SAMPLE_DOWNLOAD_PID"; then
echo "Sample download and extraction complete"
else
echo "ERROR: Sample download failed"
return 1
fi

wait_for_pods 300 5 5
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider parameterizing timeout values.

The hardcoded values 300 5 5 in wait_for_pods call may not be suitable for all environments (e.g., slower CI runners or larger clusters). Consider accepting these as optional parameters with defaults.

♻️ Suggested fix
 # Waits for sample download to complete and all pods to be ready
-wait_for_cluster_ready() {
+# @param {int} timeout_seconds Overall timeout (default: 300)
+# @param {int} poll_interval_seconds Poll interval (default: 5)
+# @param {int} wait_timeout_seconds kubectl wait timeout (default: 5)
+wait_for_cluster_ready() {
+    local timeout_seconds=${1:-300}
+    local poll_interval_seconds=${2:-5}
+    local wait_timeout_seconds=${3:-5}
+
     if wait "$SAMPLE_DOWNLOAD_PID"; then
         echo "Sample download and extraction complete"
     else
         echo "ERROR: Sample download failed"
         return 1
     fi

-    wait_for_pods 300 5 5
+    wait_for_pods "$timeout_seconds" "$poll_interval_seconds" "$wait_timeout_seconds"
 }
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 144 - 154, The
wait_for_cluster_ready function uses hardcoded timeouts when calling
wait_for_pods (300 5 5); change wait_for_cluster_ready to accept optional
parameters (e.g., pod_timeout, pod_interval, pod_retry) with sensible defaults
and pass those through to wait_for_pods, and update any callers to use defaults
when not provided; reference the function wait_for_cluster_ready and the call to
wait_for_pods to locate where to add parameters and default handling.

Comment on lines 21 to 28
# Deployment mode:
# - distributed_deployment=false: Single-node deployment. PVs use local volumes bound to one node.
# Pods automatically tolerate control-plane taints. Only works with worker replicas=1.
# - distributed_deployment=true: Multi-node deployment. PVs use hostPath without node affinity,
# assuming unmanaged shared storage (e.g., NFS/CephFS mounted via /etc/fstab) at the same path on
# all nodes.
distributed_deployment: false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Naming convention inconsistency.

The key distributed_deployment uses snake_case, while other new keys in this file (workerConcurrency, compressionWorker, queryWorker, storageClassName) use camelCase. Consider renaming to distributedDeployment for consistency.

♻️ Suggested fix
-# - distributed_deployment=false: Single-node deployment. PVs use local volumes bound to one node.
+# - distributedDeployment=false: Single-node deployment. PVs use local volumes bound to one node.
 #   Pods automatically tolerate control-plane taints. Only works with worker replicas=1.
-# - distributed_deployment=true: Multi-node deployment. PVs use hostPath without node affinity,
+# - distributedDeployment=true: Multi-node deployment. PVs use hostPath without node affinity,
 #   assuming unmanaged shared storage (e.g., NFS/CephFS mounted via /etc/fstab) at the same path on
 #   all nodes.
-distributed_deployment: false
+distributedDeployment: false

Note: This would also require updating all template references to this value.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/values.yaml around lines 21 - 28, Rename the
configuration key distributed_deployment to distributedDeployment to match the
file's camelCase convention; update the values.yaml entry and every Helm
template reference (e.g., occurrences in templates that read
.Values.distributed_deployment) to use .Values.distributedDeployment, and ensure
any documentation or defaults consuming this value are adjusted accordingly so
behavior remains unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In @tools/deployment/package-helm/set-up-multi-dedicated-test.sh:
- Around line 36-49: The script labels nodes using the worker_nodes array but
lacks bounds checks and can index past its length; before both loops validate
that ${#worker_nodes[@]} >= total_workers (or at least >= NUM_COMPRESSION_NODES
for the first loop and >= total_workers for the second) and abort or adjust
total_workers to the actual count if not, and within each iteration check that
worker_nodes[i] is non-empty before calling kubectl label node (the symbols to
update are worker_nodes, NUM_COMPRESSION_NODES, total_workers and the kubectl
label node calls).
- Around line 52-53: Replace the fragile sleep 2 after the helm uninstall test
with an explicit wait that polls for release/resource removal: after running
helm uninstall test, loop until helm no longer lists the release (or until
kubectl shows zero resources with the test release label) with a short
sleep+timeout to avoid infinite waits; update the script lines that currently
call "helm uninstall test" and "sleep 2" to perform this polling-based check
instead.

In @tools/deployment/package-helm/set-up-multi-shared-test.sh:
- Line 5: The TODO in the script indicates this setup should be migrated into an
integration test; create a new GitHub issue (e.g., "Migrate
set-up-multi-shared-test.sh into integration test") with acceptance criteria and
link it to the project board, then update the TODO comment in
set-up-multi-shared-test.sh to reference the created issue number/URL and a
short next step (owner + milestone) so the intent is tracked and actionable.
- Around line 1-6: Add bash strict mode to the script by enabling "set -euo
pipefail" (and optionally set IFS=$'\n\t') immediately after the shebang in
set-up-multi-shared-test.sh so the script exits on errors, treats unset
variables as failures, and fails pipelines properly; update any code that relies
on unset variables to define defaults or handle errors to avoid breaking under
strict mode.

In @tools/deployment/package-helm/templates/_helpers.tpl:
- Around line 318-328: The helper "clp.createSchedulingConfigs" unconditionally
appends a control-plane toleration to $tolerations when
.root.Values.distributedDeployment is false, which can produce duplicates if the
user already provided the same toleration; update the logic to check existing
$tolerations for a toleration with key "node-role.kubernetes.io/control-plane"
(and matching operator "Exists" and effect "NoSchedule") before appending so you
only add the control-plane entry when it is not already present.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5a28f9 and 84c951b.

📒 Files selected for processing (6)
  • tools/deployment/package-helm/set-up-multi-dedicated-test.sh
  • tools/deployment/package-helm/set-up-multi-shared-test.sh
  • tools/deployment/package-helm/templates/_helpers.tpl
  • tools/deployment/package-helm/templates/log-ingestor-logs-pv.yaml
  • tools/deployment/package-helm/templates/mcp-server-logs-pv.yaml
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-07T17:41:15.655Z
Learnt from: jackluo923
Repo: y-scope/clp PR: 1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.

Applied to files:

  • tools/deployment/package-helm/set-up-multi-dedicated-test.sh
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/values.yaml
🧬 Code graph analysis (1)
tools/deployment/package-helm/set-up-multi-dedicated-test.sh (1)
tools/deployment/package-helm/.set-up-common.sh (3)
  • prepare_environment (39-48)
  • generate_kind_config (53-97)
  • wait_for_cluster_ready (145-154)
🪛 Shellcheck (0.11.0)
tools/deployment/package-helm/set-up-multi-dedicated-test.sh

[info] 18-18: Not following: .set-up-common.sh was not specified as input (see shellcheck -x).

(SC1091)

tools/deployment/package-helm/set-up-multi-shared-test.sh

[info] 17-17: Not following: .set-up-common.sh was not specified as input (see shellcheck -x).

(SC1091)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: check-generated
🔇 Additional comments (9)
tools/deployment/package-helm/set-up-multi-shared-test.sh (1)

35-41: LGTM!

The Helm installation with configurable replicas via --set flags is clean and aligns well with the multi-node deployment feature. The use of functions from the common module (prepare_environment, generate_kind_config, wait_for_cluster_ready) promotes code reuse across setup scripts.

tools/deployment/package-helm/set-up-multi-dedicated-test.sh (1)

1-62: LGTM overall!

The script is well-structured with clear comments, proper environment variable defaults, and appropriate use of the common setup functions. The TODO comment on line 5 appropriately indicates this is temporary scaffolding to be replaced by integration tests.

tools/deployment/package-helm/templates/_helpers.tpl (2)

143-156: Clear and correct branching for deployment modes.

The conditional logic properly distinguishes between distributed deployment (hostPath for shared storage) and single-node deployment (local volume with control-plane node affinity). This aligns well with the documented behaviour.


329-344: Scheduling configuration helper is well-designed.

The helper properly uses with blocks to conditionally render each scheduling field only when defined, avoiding empty YAML keys. The default handling via | default dict and | default list ensures robustness against nil values.

tools/deployment/package-helm/values.yaml (3)

21-27: Excellent documentation for deployment modes.

The inline comments clearly explain the single-node vs multi-node behaviour and the storage assumptions for each mode. This will help users understand the implications of changing distributedDeployment.


32-78: Well-structured worker configuration blocks.

The commented scheduling examples provide clear guidance for users who want to customize node placement without cluttering the defaults. The consistent structure across compressionWorker, queryWorker, and reducer makes configuration intuitive.


80-84: Storage class configuration is well-documented.

The comment clearly explains that "local-storage" triggers chart-managed StorageClass creation, while other values require pre-existing resources. This prevents confusion during deployment.

tools/deployment/package-helm/templates/log-ingestor-logs-pv.yaml (1)

1-10: LGTM!

Clean migration to clp.createStaticPv. The removal of the nodeRole parameter aligns with the new helper design that derives node affinity from distributedDeployment instead.

tools/deployment/package-helm/templates/mcp-server-logs-pv.yaml (1)

1-10: LGTM!

Consistent with the chart-wide migration to clp.createStaticPv. The conditional gating for mcp_server is preserved correctly.

Comment on lines +36 to +49
echo "Labeling worker nodes..."
mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')

# Label compression nodes
for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
echo "Labeling ${worker_nodes[$i]} as compression node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=compression --overwrite
done

# Label query nodes
for ((i = NUM_COMPRESSION_NODES; i < total_workers; i++)); do
echo "Labeling ${worker_nodes[$i]} as query node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=query --overwrite
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add bounds checking before accessing worker_nodes array.

If the cluster has fewer worker nodes than expected (e.g., NUM_COMPRESSION_NODES + NUM_QUERY_NODES), accessing worker_nodes[$i] will silently fail or produce empty values. Consider validating that the array contains the expected number of nodes.

🔧 Suggested fix
 echo "Labeling worker nodes..."
 mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')
 
+if [[ ${#worker_nodes[@]} -lt ${total_workers} ]]; then
+    echo "ERROR: Expected ${total_workers} worker nodes, but found ${#worker_nodes[@]}"
+    exit 1
+fi
+
 # Label compression nodes
 for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Labeling worker nodes..."
mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')
# Label compression nodes
for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
echo "Labeling ${worker_nodes[$i]} as compression node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=compression --overwrite
done
# Label query nodes
for ((i = NUM_COMPRESSION_NODES; i < total_workers; i++)); do
echo "Labeling ${worker_nodes[$i]} as query node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=query --overwrite
done
echo "Labeling worker nodes..."
mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')
if [[ ${#worker_nodes[@]} -lt ${total_workers} ]]; then
echo "ERROR: Expected ${total_workers} worker nodes, but found ${#worker_nodes[@]}"
exit 1
fi
# Label compression nodes
for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
echo "Labeling ${worker_nodes[$i]} as compression node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=compression --overwrite
done
# Label query nodes
for ((i = NUM_COMPRESSION_NODES; i < total_workers; i++)); do
echo "Labeling ${worker_nodes[$i]} as query node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=query --overwrite
done
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-dedicated-test.sh around lines 36
- 49, The script labels nodes using the worker_nodes array but lacks bounds
checks and can index past its length; before both loops validate that
${#worker_nodes[@]} >= total_workers (or at least >= NUM_COMPRESSION_NODES for
the first loop and >= total_workers for the second) and abort or adjust
total_workers to the actual count if not, and within each iteration check that
worker_nodes[i] is non-empty before calling kubectl label node (the symbols to
update are worker_nodes, NUM_COMPRESSION_NODES, total_workers and the kubectl
label node calls).

Comment on lines +52 to +53
helm uninstall test --ignore-not-found
sleep 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider replacing magic sleep with a more robust wait.

The sleep 2 after helm uninstall is a timing-based workaround. While acceptable for a test script, a more robust approach would wait for resources to be fully removed.

♻️ Alternative approach
 helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
helm uninstall test --ignore-not-found
sleep 2
helm uninstall test --ignore-not-found
helm uninstall test --ignore-not-found --wait 2>/dev/null || true
Suggested change
helm uninstall test --ignore-not-found
sleep 2
helm uninstall test --ignore-not-found --wait
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-dedicated-test.sh around lines 52
- 53, Replace the fragile sleep 2 after the helm uninstall test with an explicit
wait that polls for release/resource removal: after running helm uninstall test,
loop until helm no longer lists the release (or until kubectl shows zero
resources with the test release label) with a short sleep+timeout to avoid
infinite waits; update the script lines that currently call "helm uninstall
test" and "sleep 2" to perform this polling-based check instead.

Comment on lines +1 to +6
#!/usr/bin/env bash

# Multi-node cluster setup with shared worker nodes
# Both compression and query workers share the same node pool
# TODO: Migrate into integration test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider enabling bash strict mode for reliability.

The script lacks set -euo pipefail which would cause the script to fail fast on errors, undefined variables, or pipeline failures. This is particularly important for setup scripts where partial execution can leave the environment in an inconsistent state.

Proposed fix
 #!/usr/bin/env bash
 
+set -euo pipefail
+
 # Multi-node cluster setup with shared worker nodes
 # Both compression and query workers share the same node pool
 # TODO: Migrate into integration test
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env bash
# Multi-node cluster setup with shared worker nodes
# Both compression and query workers share the same node pool
# TODO: Migrate into integration test
#!/usr/bin/env bash
set -euo pipefail
# Multi-node cluster setup with shared worker nodes
# Both compression and query workers share the same node pool
# TODO: Migrate into integration test
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-shared-test.sh around lines 1 -
6, Add bash strict mode to the script by enabling "set -euo pipefail" (and
optionally set IFS=$'\n\t') immediately after the shebang in
set-up-multi-shared-test.sh so the script exits on errors, treats unset
variables as failures, and fails pipelines properly; update any code that relies
on unset variables to define defaults or handle errors to avoid breaking under
strict mode.


# Multi-node cluster setup with shared worker nodes
# Both compression and query workers share the same node pool
# TODO: Migrate into integration test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Acknowledge TODO for integration test migration.

The TODO comment indicates this is a temporary setup script. If you'd like, I can help create a GitHub issue to track migrating this into a proper integration test.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-shared-test.sh at line 5, The
TODO in the script indicates this setup should be migrated into an integration
test; create a new GitHub issue (e.g., "Migrate set-up-multi-shared-test.sh into
integration test") with acceptance criteria and link it to the project board,
then update the TODO comment in set-up-multi-shared-test.sh to reference the
created issue number/URL and a short next step (owner + milestone) so the intent
is tracked and actionable.

Comment on lines +33 to +34
helm uninstall test --ignore-not-found
sleep 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Fixed sleep is fragile; consider a retry loop or event-based wait.

The sleep 2 after helm uninstall assumes 2 seconds is always sufficient for cleanup. While acceptable for a temporary setup script, a more robust approach would wait for the release to be fully removed.

Alternative approach (optional)
 helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait 2>/dev/null || true

The --wait flag makes Helm wait for resources to be deleted before returning.

Comment on lines +318 to +328
{{- define "clp.createSchedulingConfigs" -}}
{{- $componentConfig := index .root.Values .component | default dict -}}
{{- $scheduling := $componentConfig.scheduling | default dict -}}
{{- $tolerations := $scheduling.tolerations | default list -}}
{{- if not .root.Values.distributedDeployment -}}
{{- $tolerations = append $tolerations (dict
"key" "node-role.kubernetes.io/control-plane"
"operator" "Exists"
"effect" "NoSchedule"
) -}}
{{- end -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Potential duplicate toleration if user already specifies control-plane toleration.

When distributedDeployment=false, the helper unconditionally appends a control-plane toleration. If a user has already specified the same toleration in their values, it will be duplicated. While Kubernetes tolerates duplicates, it's cleaner to avoid them.

♻️ Optional: Deduplicate tolerations

You could check if the toleration already exists before appending, though this adds complexity. Given that duplicates are harmless, this is a minor improvement:

 {{- $tolerations := $scheduling.tolerations | default list -}}
 {{- if not .root.Values.distributedDeployment -}}
+{{- $hasCPToleration := false -}}
+{{- range $tolerations -}}
+{{- if and (eq (index . "key") "node-role.kubernetes.io/control-plane") (eq (index . "operator") "Exists") -}}
+{{- $hasCPToleration = true -}}
+{{- end -}}
+{{- end -}}
+{{- if not $hasCPToleration -}}
 {{- $tolerations = append $tolerations (dict
     "key" "node-role.kubernetes.io/control-plane"
     "operator" "Exists"
     "effect" "NoSchedule"
 ) -}}
+{{- end -}}
 {{- end -}}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{- define "clp.createSchedulingConfigs" -}}
{{- $componentConfig := index .root.Values .component | default dict -}}
{{- $scheduling := $componentConfig.scheduling | default dict -}}
{{- $tolerations := $scheduling.tolerations | default list -}}
{{- if not .root.Values.distributedDeployment -}}
{{- $tolerations = append $tolerations (dict
"key" "node-role.kubernetes.io/control-plane"
"operator" "Exists"
"effect" "NoSchedule"
) -}}
{{- end -}}
{{- define "clp.createSchedulingConfigs" -}}
{{- $componentConfig := index .root.Values .component | default dict -}}
{{- $scheduling := $componentConfig.scheduling | default dict -}}
{{- $tolerations := $scheduling.tolerations | default list -}}
{{- if not .root.Values.distributedDeployment -}}
{{- $hasCPToleration := false -}}
{{- range $tolerations -}}
{{- if and (eq (index . "key") "node-role.kubernetes.io/control-plane") (eq (index . "operator") "Exists") -}}
{{- $hasCPToleration = true -}}
{{- end -}}
{{- end -}}
{{- if not $hasCPToleration -}}
{{- $tolerations = append $tolerations (dict
"key" "node-role.kubernetes.io/control-plane"
"operator" "Exists"
"effect" "NoSchedule"
) -}}
{{- end -}}
{{- end -}}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/_helpers.tpl around lines 318 - 328,
The helper "clp.createSchedulingConfigs" unconditionally appends a control-plane
toleration to $tolerations when .root.Values.distributedDeployment is false,
which can produce duplicates if the user already provided the same toleration;
update the logic to check existing $tolerations for a toleration with key
"node-role.kubernetes.io/control-plane" (and matching operator "Exists" and
effect "NoSchedule") before appending so you only add the control-plane entry
when it is not already present.

junhaoliao added a commit to junhaoliao/clp that referenced this pull request Jan 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tools/deployment/package-helm/values.yaml`:
- Around line 34-49: The commented scheduling examples for compressionWorker
(and similarly for queryWorker and reducer) are missing the affinity field even
though _helpers.tpl supports it; update each worker's commented scheduling block
(compressionWorker, queryWorker, reducer) to include an example affinity stanza
consistent with the template (e.g., a podAffinity/podAntiAffinity or
nodeAffinity example) so users have a working reference that matches the
helpers.tpl affinity support.
♻️ Duplicate comments (1)
tools/deployment/package-helm/values.yaml (1)

21-29: Validation for single-node replica constraint is still missing.

The documentation correctly states that single-node mode "only work[s] with 1 replica for each worker", but there's no Helm-side enforcement. As discussed in past reviews, users can set replicas > 1 with distributedDeployment: false, causing pods to remain Pending due to local PV node affinity conflicts.

Consider adding validation in a template (e.g., _helpers.tpl or a dedicated validate-values.yaml):

Example validation template
{{- if not .Values.distributedDeployment }}
  {{- if or (gt (int .Values.compressionWorker.replicas) 1) (gt (int .Values.queryWorker.replicas) 1) (gt (int .Values.reducer.replicas) 1) }}
    {{- fail "When distributedDeployment=false, all worker replicas must be 1. Single-node mode does not support multiple replicas due to local PV constraints." }}
  {{- end }}
{{- end }}
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84c951b and 68d7ca4.

📒 Files selected for processing (1)
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/values.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: package-image
  • GitHub Check: check-generated
  • GitHub Check: build (macos-15)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (4)
tools/deployment/package-helm/values.yaml (4)

31-32: LGTM!

Clear documentation for the workerConcurrency setting.


50-65: LGTM!

Consistent structure with compressionWorker. The same note about the missing affinity example applies here as well.


66-81: LGTM!

Consistent structure with other worker configurations.


82-87: LGTM!

Clear documentation explaining the storageClassName behaviour. The default "local-storage" with WaitForFirstConsumer binding mode is appropriate for the single-node deployment scenario.

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tools/deployment/package-helm/values.yaml`:
- Around line 21-29: The values.yaml is missing the distributedDeployment key
referenced by templates (e.g., in _helpers.tpl through
.Values.distributedDeployment), causing undefined values; add a top-level
boolean entry named distributedDeployment with a sensible default (true or false
per project convention) and a brief comment matching the existing description so
templates receive a defined boolean and follow the documented behavior; ensure
placement near the existing PV/control-plane comments so Helm picks it up and
templates using .Values.distributedDeployment (lines ~143,156,322 in
_helpers.tpl) resolve correctly.
♻️ Duplicate comments (1)
tools/deployment/package-helm/values.yaml (1)

33-48: Structure looks good; previously raised concern about missing affinity example still applies.

The compressionWorker block structure is well-organized with sensible defaults. Note that the affinity scheduling option is supported by the templates (per _helpers.tpl) but is not shown in the commented examples here—this was flagged in a previous review iteration.

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 68d7ca4 and 4a87d79.

📒 Files selected for processing (1)
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/values.yaml
📚 Learning: 2025-12-04T03:31:55.239Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1690
File: tools/deployment/package/docker-compose-all.yaml:424-427
Timestamp: 2025-12-04T03:31:55.239Z
Learning: In tools/deployment/package/docker-compose-all.yaml, the query-worker service writes to /var/data/streams (CLP_STREAM_OUTPUT_DIR_HOST mount), so this directory must remain read-write and should not be mounted with :ro flag.

Applied to files:

  • tools/deployment/package-helm/values.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: check-generated
🔇 Additional comments (3)
tools/deployment/package-helm/values.yaml (3)

49-64: LGTM!

The queryWorker block follows the same consistent structure as compressionWorker, with appropriate defaults.


65-80: LGTM!

The reducer block maintains consistency with the other worker configurations.


81-87: LGTM!

The storage block with storageClassName is well-documented, clearly explaining the conditional behaviour where the chart creates the StorageClass only when set to the default value.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

kirkrodrigues
kirkrodrigues previously approved these changes Jan 14, 2026
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving config interface.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tools/deployment/package-helm/values.yaml`:
- Around line 34-49: The commented scheduling examples under compressionWorker
are incomplete: add a commented affinity example (showing nodeAffinity and
podAntiAffinity usage) under scheduling so users know that affinity is
supported, and update the topologySpreadConstraints example to include a
labelSelector that targets the compression worker pods (use the same matchLabels
used for these pods, e.g., the compression worker app label or
app.kubernetes.io/name for the chart) so the spread constraint only applies to
this pod set rather than all namespace pods.
- Around line 21-29: Add a validation that prevents setting replicas > 1 when
distributedDeployment is false by adding a values schema or template check:
create/update values.schema.json to require replicas to be <=1 when
"distributedDeployment" is false (or add a validate-values.yaml that errors
during rendering), and apply this rule to all worker replica keys (e.g.,
compressionWorker.replicas, ingestionWorker.replicas, inferenceWorker.replicas
or any other worker.*.replicas used by templates). Ensure the validation
triggers a failing render/error if .Values.distributedDeployment == false && any
worker replicas > 1 so users cannot deploy multiple replicas with local PVs.
♻️ Duplicate comments (2)
tools/deployment/package-helm/values.yaml (2)

50-65: Same scheduling example gaps as compressionWorker.

The same improvements apply here: add an affinity example and include labelSelector in topologySpreadConstraints (targeting app.kubernetes.io/component: query-worker).


66-81: Same scheduling example gaps as other workers.

Add affinity example and labelSelector in topologySpreadConstraints (targeting app.kubernetes.io/component: reducer).

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a87d79 and 107bed9.

📒 Files selected for processing (1)
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/values.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: package-image
  • GitHub Check: check-generated
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
tools/deployment/package-helm/values.yaml (1)

82-88: LGTM!

The storage configuration is well-structured with clear documentation explaining the auto-creation behaviour for the default local-storage StorageClass versus requiring a pre-existing one.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +21 to +29
# - If false: Single-node deployment.
# - PVs use local storage bound to a single node.
# - Pods automatically tolerate control-plane taints.
# - Each worker deployment supports only 1 replica.
# - If true: Multi-node deployment.
# - PVs use a path on the host (`hostPath`) without node affinity.
# - `hostPath` must point to externally managed shared storage (e.g., NFS/CephFS) mounted at the
# same path on all nodes.
distributedDeployment: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if validation for replicas constraint exists in templates
echo "=== Searching for validation logic in templates ==="
rg -n 'fail.*replicas|replicas.*fail|distributedDeployment.*fail' tools/deployment/package-helm/templates/

echo ""
echo "=== Check for validation in _helpers.tpl ==="
rg -n -A5 'validateDistributedMode|distributed.*replicas' tools/deployment/package-helm/templates/_helpers.tpl 2>/dev/null || echo "No validation found in _helpers.tpl"

echo ""
echo "=== Check for dedicated validation template ==="
cat tools/deployment/package-helm/templates/validate-values.yaml 2>/dev/null || echo "No validate-values.yaml found"

echo ""
echo "=== List all template files to identify validation patterns ==="
fd -e yaml . tools/deployment/package-helm/templates/ | head -20

echo ""
echo "=== Search more broadly for validation or constraints ==="
rg -n 'fail|error|constraint|validate' tools/deployment/package-helm/templates/ | head -20

Repository: y-scope/clp

Length of output: 2303


🏁 Script executed:

#!/bin/bash
# Check for schema validation files
echo "=== Check for values.schema.json ==="
cat tools/deployment/package-helm/values.schema.json 2>/dev/null || echo "No values.schema.json found"

echo ""
echo "=== Check for Chart.yaml constraints ==="
head -30 tools/deployment/package-helm/Chart.yaml

echo ""
echo "=== Search for any JSONSchema or schema definitions ==="
fd -e json . tools/deployment/package-helm/ 

echo ""
echo "=== Check the actual values.yaml to see how replicas are defined ==="
rg -n 'replicas:' tools/deployment/package-helm/values.yaml -A2 -B2

echo ""
echo "=== Check if any template uses range or conditional for replicas based on distributed mode ==="
rg -n 'range.*replicas|distributedDeployment.*replicas|replicas.*distributedDeployment' tools/deployment/package-helm/templates/ -A3 -B3

Repository: y-scope/clp

Length of output: 1227


🏁 Script executed:

#!/bin/bash
# Check one of the deployment templates to see if it handles replicas validation
echo "=== Check compression-worker-deployment.yaml for replica handling ==="
cat tools/deployment/package-helm/templates/compression-worker-deployment.yaml | grep -n 'replicas\|distributedDeployment\|fail\|if' | head -20

echo ""
echo "=== Show full deployment spec for replicas ==="
sed -n '1,50p' tools/deployment/package-helm/templates/compression-worker-deployment.yaml

Repository: y-scope/clp

Length of output: 2785


Add validation to prevent replicas > 1 with distributedDeployment=false.

The distributedDeployment key is now correctly defined, but no validation exists to prevent users from misconfiguring replicas. Deployment templates currently accept the replica count unconditionally: replicas: {{ .Values.compressionWorker.replicas }}.

When distributedDeployment=false with local PVs, pods configured with replicas > 1 will remain Pending due to node affinity constraints. Add validation in a validate-values.yaml template or via values.schema.json to fail rendering when distributedDeployment=false AND replicas > 1 for any worker component.

🤖 Prompt for AI Agents
In `@tools/deployment/package-helm/values.yaml` around lines 21 - 29, Add a
validation that prevents setting replicas > 1 when distributedDeployment is
false by adding a values schema or template check: create/update
values.schema.json to require replicas to be <=1 when "distributedDeployment" is
false (or add a validate-values.yaml that errors during rendering), and apply
this rule to all worker replica keys (e.g., compressionWorker.replicas,
ingestionWorker.replicas, inferenceWorker.replicas or any other
worker.*.replicas used by templates). Ensure the validation triggers a failing
render/error if .Values.distributedDeployment == false && any worker replicas >
1 so users cannot deploy multiple replicas with local PVs.

Comment on lines +34 to +49
compressionWorker:
replicas: 1
# Controls which nodes run compression workers
# scheduling:
# nodeSelector:
# yscope.io/nodeType: compute
# tolerations:
# - key: "yscope.io/dedicated"
# operator: "Equal"
# value: "compression"
# effect: "NoSchedule"
# topologySpreadConstraints:
# - maxSkew: 1
# topologyKey: "kubernetes.io/hostname"
# whenUnsatisfiable: "DoNotSchedule"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Scheduling examples are incomplete.

Two issues with the commented scheduling examples:

  1. Missing affinity example: The templates support affinity in scheduling configurations, but no example is provided. Users may not realize this option exists.

  2. topologySpreadConstraints missing labelSelector: Without labelSelector, the constraint matches all pods in the namespace, which is rarely the intended behavior. Add a selector targeting the specific worker pods.

♻️ Suggested additions to the commented example
   # scheduling:
   #   nodeSelector:
   #     yscope.io/nodeType: compute
+  #   affinity:
+  #     podAntiAffinity:
+  #       preferredDuringSchedulingIgnoredDuringExecution:
+  #         - weight: 100
+  #           podAffinityTerm:
+  #             labelSelector:
+  #               matchLabels:
+  #                 app.kubernetes.io/component: compression-worker
+  #             topologyKey: kubernetes.io/hostname
   #   tolerations:
   #     - key: "yscope.io/dedicated"
   #       operator: "Equal"
   #       value: "compression"
   #       effect: "NoSchedule"
   #   topologySpreadConstraints:
   #     - maxSkew: 1
   #       topologyKey: "kubernetes.io/hostname"
   #       whenUnsatisfiable: "DoNotSchedule"
+  #       labelSelector:
+  #         matchLabels:
+  #           app.kubernetes.io/component: compression-worker
🤖 Prompt for AI Agents
In `@tools/deployment/package-helm/values.yaml` around lines 34 - 49, The
commented scheduling examples under compressionWorker are incomplete: add a
commented affinity example (showing nodeAffinity and podAntiAffinity usage)
under scheduling so users know that affinity is supported, and update the
topologySpreadConstraints example to include a labelSelector that targets the
compression worker pods (use the same matchLabels used for these pods, e.g., the
compression worker app label or app.kubernetes.io/name for the chart) so the
spread constraint only applies to this pod set rather than all namespace pods.

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes since my last review look good to me.

@junhaoliao junhaoliao merged commit 5901a1f into y-scope:main Jan 14, 2026
21 checks passed
davidlion pushed a commit to davidlion/clp that referenced this pull request Jan 17, 2026
…uling. (y-scope#1829)

Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
Co-authored-by: Kirk Rodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need easier mechanisms to load local images for debugging. something like this has been working:

#!/usr/bin/env bash

# Single-node cluster setup for testing
# TODO: Migrate into integration test

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"

CLP_HOME="${CLP_HOME:-/tmp/clp}"
CLUSTER_NAME="${CLUSTER_NAME:-clp-test}"

# shellcheck source=.set-up-common.sh
source "${script_dir}/.set-up-common.sh"

echo "=== Single-node setup ==="
echo "Cluster: ${CLUSTER_NAME}"
echo ""

prepare_environment "${CLUSTER_NAME}"

echo "Creating kind cluster..."
generate_kind_config 0 | kind create cluster --name "${CLUSTER_NAME}" --config=-

# Load the locally-built clp-package image into kind
repo_root="${script_dir}/../../../"
iid_file="${repo_root}/build/clp-package-image.id"
if [[ -f "$iid_file" ]]; then
    image_id=$(<"$iid_file")
    user="${USER:-$(id -un 2>/dev/null || whoami 2>/dev/null || echo unknown)}"
    short_id="${image_id#sha256:}"
    short_id="${short_id:0:4}"
    image_tag="dev-${user}-${short_id}"
    echo "Loading clp-package:${image_tag} into kind cluster..."
    kind load docker-image "clp-package:${image_tag}" --name "${CLUSTER_NAME}"
else
    echo "ERROR: No locally-built image found at ${iid_file}. Run 'task docker-images:package' first."
    exit 1
fi

echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}" \
    --set "image.clpPackage.repository=clp-package" \
    --set "image.clpPackage.tag=${image_tag}" \
    --set "image.clpPackage.pullPolicy=Never"

wait_for_cluster_ready

though we should generalize it to all the set-up scripts.

@coderabbitai create a feature request issue to track

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash

# Create a GitHub issue for the feature request
gh issue create \
  --repo y-scope/clp \
  --title "feat(helm): Add local image loading support to set-up scripts" \
  --body "## Description

The set-up scripts for Helm chart testing (e.g., \`set-up-test.sh\`, \`set-up-multi-dedicated-test.sh\`, \`set-up-multi-shared-test.sh\`) currently don't provide an easy mechanism to load locally-built Docker images into the kind cluster for debugging purposes.

## Proposed Solution

Add support for automatically detecting and loading locally-built \`clp-package\` images from \`build/clp-package-image.id\` into the kind cluster. This functionality should:

1. Check if a locally-built image exists at \`build/clp-package-image.id\`
2. Generate an appropriate dev tag (e.g., \`dev-{user}-{short-hash}\`)
3. Load the image into the kind cluster using \`kind load docker-image\`
4. Configure Helm to use the local image with \`pullPolicy=Never\`

## Reference Implementation

A working example for \`set-up-test.sh\` is provided in the PR comment showing the pattern to follow.

## Context

This feature request was raised during PR review to improve developer experience when testing Helm chart changes with locally-built images.

## Related Links

- PR: https://github.com/y-scope/clp/pull/1829
- Comment: https://github.com/y-scope/clp/pull/1829#discussion_r2744085949
- Requested by: `@junhaoliao`" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao I've created a feature request issue to track this enhancement: #1937

The issue captures the need to add local image loading support across all set-up scripts, includes your reference implementation, and links back to this PR discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants