Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
fe8365f
feat(helm): Add multi-node deployment support with configurable sched…
junhaoliao Dec 26, 2025
5922b4c
bump chart version
junhaoliao Jan 7, 2026
e79a9b0
Merge branch 'main' into helm-proper-multi-node
junhaoliao Jan 7, 2026
fc80fef
Merge branch 'main' into helm-proper-multi-node
junhaoliao Jan 8, 2026
203811b
Add configurable reducer replicas and scheduling support
junhaoliao Jan 8, 2026
6d2375e
refactor(helm): Simplify log directory definition in test script
junhaoliao Jan 8, 2026
a33f5f3
Merge branch 'main' into helm-proper-multi-node
junhaoliao Jan 10, 2026
3b60031
improve param component docs - Apply suggestions from code review
junhaoliao Jan 10, 2026
553925a
refactor(helm): Rename test scripts to set-up-*
junhaoliao Jan 10, 2026
2b0e863
refactor(helm): Extract common utilities and deduplicate scripts
junhaoliao Jan 10, 2026
e416e19
fix(helm): Handle sample download failure in wait_for_cluster_ready
junhaoliao Jan 10, 2026
bdc0207
bump chart version
junhaoliao Jan 12, 2026
d642a48
refactor(helm): Rename `distributed` to `distributed_deployment` for …
junhaoliao Jan 12, 2026
f5a28f9
Merge branch 'main' into helm-proper-multi-node
junhaoliao Jan 12, 2026
f64d4c0
refactor(helm): Rename `distributed_deployment` to `distributedDeploy…
junhaoliao Jan 12, 2026
84c951b
refactor(helm): Replace `createLocalPv` with `createStaticPv` for log…
junhaoliao Jan 12, 2026
68d7ca4
Minor edits for clarity.
kirkrodrigues Jan 14, 2026
4a87d79
improve values.yaml docstrings - Apply suggestions from code review
junhaoliao Jan 14, 2026
35de162
Merge branch 'main' into helm-proper-multi-node
junhaoliao Jan 14, 2026
107bed9
Add missing`distributedDeployment` option to values.yaml
junhaoliao Jan 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions tools/deployment/package-helm/.set-up-common.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/usr/bin/env bash

# Common utilities for Helm chart set-up scripts
# Source this file from set-up-*.sh scripts

set -o errexit
set -o nounset
set -o pipefail

# Creates required directories for CLP data and logs
create_clp_directories() {
echo "Creating CLP directories at ${CLP_HOME}..."
mkdir -p "$CLP_HOME/var/"{data,log}/{database,queue,redis,results_cache} \
"$CLP_HOME/var/data/"{archives,streams,staged-archives,staged-streams} \
"$CLP_HOME/var/log/"{compression_scheduler,compression_worker,user} \
"$CLP_HOME/var/log/"{query_scheduler,query_worker,reducer} \
"$CLP_HOME/var/log/"{api_server,garbage_collector,log_ingestor,mcp_server} \
"$CLP_HOME/var/tmp" \
"$CLP_HOME/samples"
}

# Downloads sample datasets in the background
# Sets SAMPLE_DOWNLOAD_PID global variable
download_samples() {
echo "Downloading sample datasets..."
wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
| tar xz -C "$CLP_HOME/samples" &
SAMPLE_DOWNLOAD_PID=$!
Comment on lines +24 to +28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding timeout options to wget.

The wget command lacks timeout configuration. If the download stalls (network issues, unresponsive server), the background process could hang indefinitely, causing wait_for_cluster_ready to block.

Consider adding --timeout and optionally --tries for robustness:

♻️ Proposed fix
 download_samples() {
     echo "Downloading sample datasets..."
-    wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
+    wget --timeout=60 --tries=3 -q -O - \
+        https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
       | tar xz -C "$CLP_HOME/samples" &
     SAMPLE_DOWNLOAD_PID=$!
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
download_samples() {
echo "Downloading sample datasets..."
wget -O - https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
| tar xz -C "$CLP_HOME/samples" &
SAMPLE_DOWNLOAD_PID=$!
download_samples() {
echo "Downloading sample datasets..."
wget --timeout=60 --tries=3 -q -O - \
https://zenodo.org/records/10516402/files/postgresql.tar.gz?download=1 \
| tar xz -C "$CLP_HOME/samples" &
SAMPLE_DOWNLOAD_PID=$!
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 24 - 28, The
download_samples function spawns wget in background without timeouts; update the
wget invocation in download_samples to include network timeout and retry flags
(e.g., --timeout and --tries and optionally --read-timeout or --waitretry) so
the background download cannot hang indefinitely, keeping the pipeline of piping
to tar and still exporting SAMPLE_DOWNLOAD_PID for the background process;
ensure the modified wget flags are added to the same command pipeline that
assigns SAMPLE_DOWNLOAD_PID.


# Generate sample log file for garbage collector testing
cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
{"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
{"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
EOF
Comment on lines +31 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Portability issue: date +%s%3N fails on macOS.

The %N format specifier for nanoseconds is GNU-specific and not supported by macOS's BSD date command, which will cause this script to fail on macOS systems.

🔧 Proposed fix for cross-platform compatibility

Use a fallback that works on both GNU and BSD date:

     # Generate sample log file for garbage collector testing
     cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
-{"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
-{"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
+{"timestamp": $(date +%s)000, "level": "INFO", "message": "User login successful"}
+{"timestamp": $(date +%s)000, "level": "ERROR", "message": "Database connection failed"}
 EOF

This uses second precision with three zeros appended, which works on both platforms and is acceptable for test data.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
{"timestamp": $(date +%s%3N), "level": "INFO", "message": "User login successful"}
{"timestamp": $(date +%s%3N), "level": "ERROR", "message": "Database connection failed"}
EOF
cat <<EOF > "$CLP_HOME/samples/test-gc.jsonl"
{"timestamp": $(date +%s)000, "level": "INFO", "message": "User login successful"}
{"timestamp": $(date +%s)000, "level": "ERROR", "message": "Database connection failed"}
EOF
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 31 - 34, The
timestamp generation using date +%s%3N is not portable to macOS; replace the
GNU-specific $(date +%s%3N) calls in the here-doc that writes to
"$CLP_HOME/samples/test-gc.jsonl" with a portable fallback like $(date +%s)000
(i.e., use seconds and append three zeros) so both lines producing timestamps
will work on GNU and BSD/macOS systems.

}
Comment on lines +22 to +35
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider declaring the global variable at script scope.

SAMPLE_DOWNLOAD_PID is set here and used in wait_for_cluster_ready(). This implicit coupling could be made clearer by declaring the variable at script scope (e.g., near the top of the file with a comment) or by documenting the dependency in the function's docstring.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 22 - 35, The
SAMPLE_DOWNLOAD_PID global is implicitly created in download_samples() and later
consumed by wait_for_cluster_ready(), so declare and initialize
SAMPLE_DOWNLOAD_PID at script scope (near the top of the file) with a short
comment describing its purpose to make the coupling explicit; update or add a
one-line docstring/comment on wait_for_cluster_ready() to mention it relies on
SAMPLE_DOWNLOAD_PID being set by download_samples(), and initialize the variable
to a safe default (e.g., empty or 0) so checks against it are robust.


# Cleans up existing cluster and prepares environment
# @param {string} cluster_name Name of the kind cluster
prepare_environment() {
local cluster_name=$1

echo "Deleting existing cluster if present..."
kind delete cluster --name "${cluster_name}" 2>/dev/null || true

rm -rf "$CLP_HOME"
create_clp_directories
Comment on lines +44 to +46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add safeguard against dangerous CLP_HOME values.

While nounset protects against an unset CLP_HOME, it doesn't guard against dangerous values like / or /home. Consider adding a sanity check before rm -rf:

♻️ Proposed fix
+    if [[ -z "${CLP_HOME}" || "${CLP_HOME}" == "/" ]]; then
+        echo "ERROR: CLP_HOME is empty or set to root"
+        return 1
+    fi
+
     rm -rf "$CLP_HOME"
     create_clp_directories
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rm -rf "$CLP_HOME"
create_clp_directories
if [[ -z "${CLP_HOME}" || "${CLP_HOME}" == "/" ]]; then
echo "ERROR: CLP_HOME is empty or set to root"
return 1
fi
rm -rf "$CLP_HOME"
create_clp_directories

download_samples
}
Comment on lines +37 to +48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate CLP_HOME before destructive operation.

The rm -rf "$CLP_HOME" command at line 45 could delete unintended data if CLP_HOME is set to an incorrect value (e.g., /, /home, etc.). While the default /tmp/clp is safe and the calling scripts set it explicitly, adding validation would prevent accidents.

🛡️ Suggested safeguard
 # Cleans up existing cluster and prepares environment
 # @param {string} cluster_name Name of the kind cluster
 prepare_environment() {
     local cluster_name=$1
+    
+    # Validate CLP_HOME is set and looks safe
+    if [[ -z "${CLP_HOME:-}" ]] || [[ "$CLP_HOME" == "/" ]] || [[ "$CLP_HOME" == "/home" ]]; then
+        echo "ERROR: Invalid or unsafe CLP_HOME: ${CLP_HOME:-<unset>}"
+        return 1
+    fi
 
     echo "Deleting existing cluster if present..."
     kind delete cluster --name "${cluster_name}" 2>/dev/null || true
 
     rm -rf "$CLP_HOME"
     create_clp_directories
     download_samples
 }
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 37 - 48, Before
running rm -rf, validate CLP_HOME in prepare_environment: resolve CLP_HOME to an
absolute path (realpath/readlink -f), ensure it is non-empty, not "/" and either
resides under /tmp or /var/tmp or contains the token "clp" (so
accidentally-pointing values like "/" or "/home" are rejected); if validation
fails, print an error and exit non-zero instead of deleting; keep the rest of
the flow (create_clp_directories, download_samples) unchanged.


# Generates kind cluster configuration YAML
#
# @param {int} num_workers Number of worker nodes (0 for single-node cluster)
generate_kind_config() {
local num_workers=${1:-0}

cat <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraMounts:
- hostPath: /home
containerPath: /home
- hostPath: $CLP_HOME
containerPath: $CLP_HOME
extraPortMappings:
- containerPort: 30000
hostPort: 30000
protocol: TCP
- containerPort: 30017
hostPort: 30017
protocol: TCP
- containerPort: 30301
hostPort: 30301
protocol: TCP
- containerPort: 30302
hostPort: 30302
protocol: TCP
- containerPort: 30306
hostPort: 30306
protocol: TCP
- containerPort: 30800
hostPort: 30800
protocol: TCP
EOF

for ((i = 0; i < num_workers; i++)); do
cat <<EOF
- role: worker
extraMounts:
- hostPath: /home
containerPath: /home
- hostPath: $CLP_HOME
containerPath: $CLP_HOME
EOF
done
}

# Waits for all jobs to complete and all non-job pods to be ready.
#
# @param {int} timeout_seconds Overall timeout in seconds
# @param {int} poll_interval_seconds Interval between status checks
# @param {int} wait_timeout_seconds Timeout for each kubectl wait call
# @return {int} 0 on success, 1 on timeout
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3

echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."

# Reset bash built-in SECONDS counter
SECONDS=0

while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
Comment on lines +117 to +119
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Sleep before first check causes unnecessary delay.

The sleep at the start of the loop means the first status check is delayed. Consider moving the sleep to the end of the loop or using a do-while pattern:

♻️ Proposed fix
     while true; do
-        sleep "${poll_interval_seconds}"
         kubectl get pods
 
         if kubectl wait job \
@@ ...
             echo "---"
         fi
+
+        sleep "${poll_interval_seconds}"
     done
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 114 - 116, The
loop currently sleeps before the first kubectl check, causing an unnecessary
initial delay; change the while-true loop that references poll_interval_seconds
so the first kubectl get pods runs immediately by moving the sleep to the end of
the loop (or convert the loop to a do-while style) so subsequent iterations
still wait poll_interval_seconds between checks.


if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
Comment on lines +121 to +130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Error suppression may hide legitimate kubectl failures.

The 2>/dev/null on both kubectl wait commands hides all stderr output. While this reduces noise when pods aren't ready, it could also hide legitimate errors (API failures, invalid selectors). Consider logging errors on final timeout or using a more selective approach.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 118 - 127, The
kubectl wait invocations currently redirect stderr to /dev/null which can hide
real errors; modify the two kubectl wait commands (the ones using --all and the
pods wait with --selector='!job-name') to capture stderr into variables (or
temporary files) instead of discarding it, and only suppress or ignore expected
"timed out" messages while printing/logging any other stderr output on failure
using the existing wait_timeout_seconds to decide final timeout; ensure the
script exits or logs the captured stderr when kubectl returns a non-zero code
that is not a benign timeout.

echo "All jobs completed and services are ready."
return 0
fi

if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi

echo "---"
done
Comment on lines +117 to +141
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider moving sleep after the status check.

Placing sleep at the start of the loop (Line 118) adds an unnecessary delay before the first status check. Moving it to the end of the loop would allow faster completion detection when pods are already ready.

♻️ Suggested fix
     while true; do
-        sleep "${poll_interval_seconds}"
         kubectl get pods

         if kubectl wait job \
                 --all \
                 --for=condition=Complete \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null \
             && kubectl wait pods \
                 --all \
                 --selector='!job-name' \
                 --for=condition=Ready \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null
         then
             echo "All jobs completed and services are ready."
             return 0
         fi

         if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
             echo "ERROR: Timed out waiting for pods to be ready"
             return 1
         fi

         echo "---"
+        sleep "${poll_interval_seconds}"
     done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
while true; do
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
sleep "${poll_interval_seconds}"
done
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 117 - 141, The
loop currently sleeps at the top (sleep "${poll_interval_seconds}") which delays
the first status check; move the sleep call to the end of the while true loop so
the code runs the kubectl checks (the kubectl wait job --all ... and kubectl
wait pods --selector='!job-name' ...) immediately on first iteration and only
sleeps before the next iteration, keeping the timeout logic (if [[ ${SECONDS}
-ge ${timeout_seconds} ]]) and the echo "---" behavior unchanged.

}
Comment on lines +105 to +142
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Timeout logic may exceed the specified limit.

The timeout check occurs after the kubectl wait calls (lines 34-42), which can each run for up to wait_timeout_seconds. This means the actual timeout could exceed timeout_seconds by up to wait_timeout_seconds before the function returns.

Additionally, suppressing stderr with 2>/dev/null hides all errors, including legitimate issues such as connection failures or missing resources, which complicates debugging.

♻️ Suggested improvements
  1. Check timeout before waiting:
     while true; do
+        if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
+            echo "ERROR: Timed out waiting for pods to be ready"
+            return 1
+        fi
+
         sleep "${poll_interval_seconds}"
         kubectl get pods
 
         if kubectl wait job \
                 --all \
                 --for=condition=Complete \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null \
             && kubectl wait pods \
                 --all \
                 --selector='!job-name' \
                 --for=condition=Ready \
                 --timeout="${wait_timeout_seconds}s" 2>/dev/null
         then
             echo "All jobs completed and services are ready."
             return 0
         fi
 
-        if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
-            echo "ERROR: Timed out waiting for pods to be ready"
-            return 1
-        fi
-
         echo "---"
     done
  1. Capture and log kubectl errors instead of suppressing them entirely:
-        if kubectl wait job \
-                --all \
-                --for=condition=Complete \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null \
-            && kubectl wait pods \
-                --all \
-                --selector='!job-name' \
-                --for=condition=Ready \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null
+        local job_wait_output
+        local pod_wait_output
+        job_wait_output=$(kubectl wait job \
+                --all \
+                --for=condition=Complete \
+                --timeout="${wait_timeout_seconds}s" 2>&1) \
+            && pod_wait_output=$(kubectl wait pods \
+                --all \
+                --selector='!job-name' \
+                --for=condition=Ready \
+                --timeout="${wait_timeout_seconds}s" 2>&1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
}
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
echo "---"
done
}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.test-common.sh around lines 18 - 55, In
wait_for_pods(), the current timeout check happens after potentially long
kubectl wait calls and stderr is fully suppressed; change the loop to compute
remaining time before calling kubectl wait (e.g.,
remaining=timeout_seconds-SECONDS) and if remaining<=0 return timeout error,
then pass a bounded per-call timeout (min(remaining, wait_timeout_seconds)) to
each kubectl wait invocation; also stop redirecting stderr to /dev/null —
capture kubectl stderr output and log it (via echo or a logger) when kubectl
fails so useful errors (connection failures, missing resources) are visible.

Comment on lines +99 to +142
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider preserving error messages during final wait attempt.

The polling logic is sound, but suppressing stderr with 2>/dev/null on every kubectl wait call hides potentially useful error messages. While this reduces noise during normal polling, it makes debugging harder when pods fail to become ready.

♻️ Optional improvement for better diagnostics
     while true; do
         sleep "${poll_interval_seconds}"
         kubectl get pods
 
+        # On final attempt, show errors for debugging
+        if [[ ${SECONDS} -ge $((timeout_seconds - poll_interval_seconds)) ]]; then
+            redirect=""
+        else
+            redirect="2>/dev/null"
+        fi
+
-        if kubectl wait job \
+        if eval kubectl wait job \
                 --all \
                 --for=condition=Complete \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null \
-            && kubectl wait pods \
+                --timeout="${wait_timeout_seconds}s" $redirect \
+            && eval kubectl wait pods \
                 --all \
                 --selector='!job-name' \
                 --for=condition=Ready \
-                --timeout="${wait_timeout_seconds}s" 2>/dev/null
+                --timeout="${wait_timeout_seconds}s" $redirect
         then
             echo "All jobs completed and services are ready."
             return 0
         fi

Note: This uses eval to conditionally apply redirection, which requires careful handling.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Waits for all jobs to complete and all non-job pods to be ready.
#
# @param {int} timeout_seconds Overall timeout in seconds
# @param {int} poll_interval_seconds Interval between status checks
# @param {int} wait_timeout_seconds Timeout for each kubectl wait call
# @return {int} 0 on success, 1 on timeout
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
if kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" 2>/dev/null \
&& kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" 2>/dev/null
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
}
# Waits for all jobs to complete and all non-job pods to be ready.
#
# @param {int} timeout_seconds Overall timeout in seconds
# @param {int} poll_interval_seconds Interval between status checks
# @param {int} wait_timeout_seconds Timeout for each kubectl wait call
# @return {int} 0 on success, 1 on timeout
wait_for_pods() {
local timeout_seconds=$1
local poll_interval_seconds=$2
local wait_timeout_seconds=$3
echo "Waiting for all pods to be ready" \
"(timeout=${timeout_seconds}s, poll=${poll_interval_seconds}s," \
"wait=${wait_timeout_seconds}s)..."
# Reset bash built-in SECONDS counter
SECONDS=0
while true; do
sleep "${poll_interval_seconds}"
kubectl get pods
# On final attempt, show errors for debugging
if [[ ${SECONDS} -ge $((timeout_seconds - poll_interval_seconds)) ]]; then
redirect=""
else
redirect="2>/dev/null"
fi
if eval kubectl wait job \
--all \
--for=condition=Complete \
--timeout="${wait_timeout_seconds}s" $redirect \
&& eval kubectl wait pods \
--all \
--selector='!job-name' \
--for=condition=Ready \
--timeout="${wait_timeout_seconds}s" $redirect
then
echo "All jobs completed and services are ready."
return 0
fi
if [[ ${SECONDS} -ge ${timeout_seconds} ]]; then
echo "ERROR: Timed out waiting for pods to be ready"
return 1
fi
echo "---"
done
}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 96 - 139, In
wait_for_pods, don’t permanently discard kubectl wait stderr; keep the current
2>/dev/null suppression during normal polling but on the final attempt (when
SECONDS + poll_interval_seconds >= timeout_seconds or right before returning
timeout) run the kubectl wait job and kubectl wait pods commands without the
2>/dev/null redirection so error diagnostics are preserved; update the
conditional that checks kubectl wait (inside wait_for_pods) to choose whether to
include the redirection based on remaining time and use the same
function/variable names (wait_for_pods, kubectl wait, SECONDS, timeout_seconds,
poll_interval_seconds, wait_timeout_seconds).


# Waits for sample download to complete and all pods to be ready
wait_for_cluster_ready() {
if wait "$SAMPLE_DOWNLOAD_PID"; then
echo "Sample download and extraction complete"
else
echo "ERROR: Sample download failed"
return 1
fi

wait_for_pods 300 5 5
}
Comment on lines +144 to +154
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider parameterizing timeout values.

The hardcoded values 300 5 5 in wait_for_pods call may not be suitable for all environments (e.g., slower CI runners or larger clusters). Consider accepting these as optional parameters with defaults.

♻️ Suggested fix
 # Waits for sample download to complete and all pods to be ready
-wait_for_cluster_ready() {
+# @param {int} timeout_seconds Overall timeout (default: 300)
+# @param {int} poll_interval_seconds Poll interval (default: 5)
+# @param {int} wait_timeout_seconds kubectl wait timeout (default: 5)
+wait_for_cluster_ready() {
+    local timeout_seconds=${1:-300}
+    local poll_interval_seconds=${2:-5}
+    local wait_timeout_seconds=${3:-5}
+
     if wait "$SAMPLE_DOWNLOAD_PID"; then
         echo "Sample download and extraction complete"
     else
         echo "ERROR: Sample download failed"
         return 1
     fi

-    wait_for_pods 300 5 5
+    wait_for_pods "$timeout_seconds" "$poll_interval_seconds" "$wait_timeout_seconds"
 }
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/.set-up-common.sh around lines 144 - 154, The
wait_for_cluster_ready function uses hardcoded timeouts when calling
wait_for_pods (300 5 5); change wait_for_cluster_ready to accept optional
parameters (e.g., pod_timeout, pod_interval, pod_retry) with sensible defaults
and pass those through to wait_for_pods, and update any callers to use defaults
when not provided; reference the function wait_for_cluster_ready and the call to
wait_for_pods to locate where to add parameters and default handling.

2 changes: 1 addition & 1 deletion tools/deployment/package-helm/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
apiVersion: "v2"
name: "clp"
version: "0.1.2-dev.17"
version: "0.1.2-dev.18"
description: "A Helm chart for CLP's (Compressed Log Processor) package deployment"
type: "application"
appVersion: "0.7.1-dev"
Expand Down
62 changes: 62 additions & 0 deletions tools/deployment/package-helm/set-up-multi-dedicated-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/usr/bin/env bash

# Multi-node cluster setup with dedicated worker nodes for each worker type
# Demonstrates nodeSelector scheduling with separate node pools
# TODO: Migrate into integration test

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"

CLP_HOME="${CLP_HOME:-/tmp/clp}"
CLUSTER_NAME="${CLUSTER_NAME:-clp-test}"
NUM_COMPRESSION_NODES="${NUM_COMPRESSION_NODES:-2}"
NUM_QUERY_NODES="${NUM_QUERY_NODES:-2}"
COMPRESSION_WORKER_REPLICAS="${COMPRESSION_WORKER_REPLICAS:-2}"
QUERY_WORKER_REPLICAS="${QUERY_WORKER_REPLICAS:-2}"
REDUCER_REPLICAS="${REDUCER_REPLICAS:-2}"

# shellcheck source=.set-up-common.sh
source "${script_dir}/.set-up-common.sh"

echo "=== Multi-node setup with dedicated worker nodes ==="
echo "Cluster: ${CLUSTER_NAME}"
echo "Compression nodes: ${NUM_COMPRESSION_NODES}"
echo "Query nodes: ${NUM_QUERY_NODES}"
echo "Compression workers: ${COMPRESSION_WORKER_REPLICAS}"
echo "Query workers: ${QUERY_WORKER_REPLICAS}"
echo "Reducers: ${REDUCER_REPLICAS}"
echo ""

prepare_environment "${CLUSTER_NAME}"

total_workers=$((NUM_COMPRESSION_NODES + NUM_QUERY_NODES))

echo "Creating kind cluster..."
generate_kind_config "${total_workers}" | kind create cluster --name "${CLUSTER_NAME}" --config=-

echo "Labeling worker nodes..."
mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')

# Label compression nodes
for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
echo "Labeling ${worker_nodes[$i]} as compression node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=compression --overwrite
done

# Label query nodes
for ((i = NUM_COMPRESSION_NODES; i < total_workers; i++)); do
echo "Labeling ${worker_nodes[$i]} as query node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=query --overwrite
done
Comment on lines +36 to +49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add bounds checking before accessing worker_nodes array.

If the cluster has fewer worker nodes than expected (e.g., NUM_COMPRESSION_NODES + NUM_QUERY_NODES), accessing worker_nodes[$i] will silently fail or produce empty values. Consider validating that the array contains the expected number of nodes.

🔧 Suggested fix
 echo "Labeling worker nodes..."
 mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')
 
+if [[ ${#worker_nodes[@]} -lt ${total_workers} ]]; then
+    echo "ERROR: Expected ${total_workers} worker nodes, but found ${#worker_nodes[@]}"
+    exit 1
+fi
+
 # Label compression nodes
 for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Labeling worker nodes..."
mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')
# Label compression nodes
for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
echo "Labeling ${worker_nodes[$i]} as compression node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=compression --overwrite
done
# Label query nodes
for ((i = NUM_COMPRESSION_NODES; i < total_workers; i++)); do
echo "Labeling ${worker_nodes[$i]} as query node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=query --overwrite
done
echo "Labeling worker nodes..."
mapfile -t worker_nodes < <(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n')
if [[ ${#worker_nodes[@]} -lt ${total_workers} ]]; then
echo "ERROR: Expected ${total_workers} worker nodes, but found ${#worker_nodes[@]}"
exit 1
fi
# Label compression nodes
for ((i = 0; i < NUM_COMPRESSION_NODES; i++)); do
echo "Labeling ${worker_nodes[$i]} as compression node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=compression --overwrite
done
# Label query nodes
for ((i = NUM_COMPRESSION_NODES; i < total_workers; i++)); do
echo "Labeling ${worker_nodes[$i]} as query node"
kubectl label node "${worker_nodes[$i]}" yscope.io/nodeType=query --overwrite
done
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-dedicated-test.sh around lines 36
- 49, The script labels nodes using the worker_nodes array but lacks bounds
checks and can index past its length; before both loops validate that
${#worker_nodes[@]} >= total_workers (or at least >= NUM_COMPRESSION_NODES for
the first loop and >= total_workers for the second) and abort or adjust
total_workers to the actual count if not, and within each iteration check that
worker_nodes[i] is non-empty before calling kubectl label node (the symbols to
update are worker_nodes, NUM_COMPRESSION_NODES, total_workers and the kubectl
label node calls).


echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
Comment on lines +52 to +53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider replacing magic sleep with a more robust wait.

The sleep 2 after helm uninstall is a timing-based workaround. While acceptable for a test script, a more robust approach would wait for resources to be fully removed.

♻️ Alternative approach
 helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
helm uninstall test --ignore-not-found
sleep 2
helm uninstall test --ignore-not-found
helm uninstall test --ignore-not-found --wait 2>/dev/null || true
Suggested change
helm uninstall test --ignore-not-found
sleep 2
helm uninstall test --ignore-not-found --wait
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-dedicated-test.sh around lines 52
- 53, Replace the fragile sleep 2 after the helm uninstall test with an explicit
wait that polls for release/resource removal: after running helm uninstall test,
loop until helm no longer lists the release (or until kubectl shows zero
resources with the test release label) with a short sleep+timeout to avoid
infinite waits; update the script lines that currently call "helm uninstall
test" and "sleep 2" to perform this polling-based check instead.

helm install test "${script_dir}" \
--set "distributedDeployment=true" \
--set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
--set "compressionWorker.scheduling.nodeSelector.yscope\.io/nodeType=compression" \
--set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
--set "queryWorker.scheduling.nodeSelector.yscope\.io/nodeType=query" \
--set "reducer.replicas=${REDUCER_REPLICAS}"

wait_for_cluster_ready
41 changes: 41 additions & 0 deletions tools/deployment/package-helm/set-up-multi-shared-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env bash

# Multi-node cluster setup with shared worker nodes
# Both compression and query workers share the same node pool
# TODO: Migrate into integration test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Acknowledge TODO for integration test migration.

The TODO comment indicates this is a temporary setup script. If you'd like, I can help create a GitHub issue to track migrating this into a proper integration test.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-shared-test.sh at line 5, The
TODO in the script indicates this setup should be migrated into an integration
test; create a new GitHub issue (e.g., "Migrate set-up-multi-shared-test.sh into
integration test") with acceptance criteria and link it to the project board,
then update the TODO comment in set-up-multi-shared-test.sh to reference the
created issue number/URL and a short next step (owner + milestone) so the intent
is tracked and actionable.


Comment on lines +1 to +6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider enabling bash strict mode for reliability.

The script lacks set -euo pipefail which would cause the script to fail fast on errors, undefined variables, or pipeline failures. This is particularly important for setup scripts where partial execution can leave the environment in an inconsistent state.

Proposed fix
 #!/usr/bin/env bash
 
+set -euo pipefail
+
 # Multi-node cluster setup with shared worker nodes
 # Both compression and query workers share the same node pool
 # TODO: Migrate into integration test
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env bash
# Multi-node cluster setup with shared worker nodes
# Both compression and query workers share the same node pool
# TODO: Migrate into integration test
#!/usr/bin/env bash
set -euo pipefail
# Multi-node cluster setup with shared worker nodes
# Both compression and query workers share the same node pool
# TODO: Migrate into integration test
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-multi-shared-test.sh around lines 1 -
6, Add bash strict mode to the script by enabling "set -euo pipefail" (and
optionally set IFS=$'\n\t') immediately after the shebang in
set-up-multi-shared-test.sh so the script exits on errors, treats unset
variables as failures, and fails pipelines properly; update any code that relies
on unset variables to define defaults or handle errors to avoid breaking under
strict mode.

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"

CLP_HOME="${CLP_HOME:-/tmp/clp}"
CLUSTER_NAME="${CLUSTER_NAME:-clp-test}"
NUM_WORKER_NODES="${NUM_WORKER_NODES:-2}"
COMPRESSION_WORKER_REPLICAS="${COMPRESSION_WORKER_REPLICAS:-2}"
QUERY_WORKER_REPLICAS="${QUERY_WORKER_REPLICAS:-2}"
REDUCER_REPLICAS="${REDUCER_REPLICAS:-2}"

# shellcheck source=.set-up-common.sh
source "${script_dir}/.set-up-common.sh"

echo "=== Multi-node setup with shared worker nodes ==="
echo "Cluster: ${CLUSTER_NAME}"
echo "Worker nodes: ${NUM_WORKER_NODES}"
echo "Compression workers: ${COMPRESSION_WORKER_REPLICAS}"
echo "Query workers: ${QUERY_WORKER_REPLICAS}"
echo "Reducers: ${REDUCER_REPLICAS}"
echo ""

prepare_environment "${CLUSTER_NAME}"

echo "Creating kind cluster..."
generate_kind_config "${NUM_WORKER_NODES}" | kind create cluster --name "${CLUSTER_NAME}" --config=-

echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
Comment on lines +33 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Fixed sleep is fragile; consider a retry loop or event-based wait.

The sleep 2 after helm uninstall assumes 2 seconds is always sufficient for cleanup. While acceptable for a temporary setup script, a more robust approach would wait for the release to be fully removed.

Alternative approach (optional)
 helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait 2>/dev/null || true

The --wait flag makes Helm wait for resources to be deleted before returning.

helm install test "${script_dir}" \
--set "distributedDeployment=true" \
--set "compressionWorker.replicas=${COMPRESSION_WORKER_REPLICAS}" \
--set "queryWorker.replicas=${QUERY_WORKER_REPLICAS}" \
--set "reducer.replicas=${REDUCER_REPLICAS}"

wait_for_cluster_ready
28 changes: 28 additions & 0 deletions tools/deployment/package-helm/set-up-test.sh
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need easier mechanisms to load local images for debugging. something like this has been working:

#!/usr/bin/env bash

# Single-node cluster setup for testing
# TODO: Migrate into integration test

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"

CLP_HOME="${CLP_HOME:-/tmp/clp}"
CLUSTER_NAME="${CLUSTER_NAME:-clp-test}"

# shellcheck source=.set-up-common.sh
source "${script_dir}/.set-up-common.sh"

echo "=== Single-node setup ==="
echo "Cluster: ${CLUSTER_NAME}"
echo ""

prepare_environment "${CLUSTER_NAME}"

echo "Creating kind cluster..."
generate_kind_config 0 | kind create cluster --name "${CLUSTER_NAME}" --config=-

# Load the locally-built clp-package image into kind
repo_root="${script_dir}/../../../"
iid_file="${repo_root}/build/clp-package-image.id"
if [[ -f "$iid_file" ]]; then
    image_id=$(<"$iid_file")
    user="${USER:-$(id -un 2>/dev/null || whoami 2>/dev/null || echo unknown)}"
    short_id="${image_id#sha256:}"
    short_id="${short_id:0:4}"
    image_tag="dev-${user}-${short_id}"
    echo "Loading clp-package:${image_tag} into kind cluster..."
    kind load docker-image "clp-package:${image_tag}" --name "${CLUSTER_NAME}"
else
    echo "ERROR: No locally-built image found at ${iid_file}. Run 'task docker-images:package' first."
    exit 1
fi

echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}" \
    --set "image.clpPackage.repository=clp-package" \
    --set "image.clpPackage.tag=${image_tag}" \
    --set "image.clpPackage.pullPolicy=Never"

wait_for_cluster_ready

though we should generalize it to all the set-up scripts.

@coderabbitai create a feature request issue to track

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/usr/bin/env bash

# Create a GitHub issue for the feature request
gh issue create \
  --repo y-scope/clp \
  --title "feat(helm): Add local image loading support to set-up scripts" \
  --body "## Description

The set-up scripts for Helm chart testing (e.g., \`set-up-test.sh\`, \`set-up-multi-dedicated-test.sh\`, \`set-up-multi-shared-test.sh\`) currently don't provide an easy mechanism to load locally-built Docker images into the kind cluster for debugging purposes.

## Proposed Solution

Add support for automatically detecting and loading locally-built \`clp-package\` images from \`build/clp-package-image.id\` into the kind cluster. This functionality should:

1. Check if a locally-built image exists at \`build/clp-package-image.id\`
2. Generate an appropriate dev tag (e.g., \`dev-{user}-{short-hash}\`)
3. Load the image into the kind cluster using \`kind load docker-image\`
4. Configure Helm to use the local image with \`pullPolicy=Never\`

## Reference Implementation

A working example for \`set-up-test.sh\` is provided in the PR comment showing the pattern to follow.

## Context

This feature request was raised during PR review to improve developer experience when testing Helm chart changes with locally-built images.

## Related Links

- PR: https://github.com/y-scope/clp/pull/1829
- Comment: https://github.com/y-scope/clp/pull/1829#discussion_r2744085949
- Requested by: `@junhaoliao`" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao I've created a feature request issue to track this enhancement: #1937

The issue captures the need to add local image loading support across all set-up scripts, includes your reference implementation, and links back to this PR discussion.

Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash

# Single-node cluster setup for testing
# TODO: Migrate into integration test

script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"

CLP_HOME="${CLP_HOME:-/tmp/clp}"
CLUSTER_NAME="${CLUSTER_NAME:-clp-test}"

# shellcheck source=.set-up-common.sh
source "${script_dir}/.set-up-common.sh"

echo "=== Single-node setup ==="
echo "Cluster: ${CLUSTER_NAME}"
echo ""

prepare_environment "${CLUSTER_NAME}"

echo "Creating kind cluster..."
generate_kind_config 0 | kind create cluster --name "${CLUSTER_NAME}" --config=-

echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}"
Comment on lines +23 to +26
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider polling instead of fixed sleep.

The 2-second sleep between helm uninstall and helm install is a workaround to allow resource finalization. For more robust behaviour, consider polling for resource deletion or using helm uninstall --wait, which blocks until resources are deleted.

♻️ Suggested improvement
 echo "Installing Helm chart..."
-helm uninstall test --ignore-not-found
-sleep 2
+helm uninstall test --ignore-not-found --wait
 helm install test "${script_dir}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found
sleep 2
helm install test "${script_dir}"
echo "Installing Helm chart..."
helm uninstall test --ignore-not-found --wait
helm install test "${script_dir}"
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/set-up-test.sh around lines 23 - 26, The fixed
2-second sleep after "helm uninstall test" is fragile; replace it with a robust
wait: either add the --wait flag to the helm uninstall call (use "helm uninstall
test --wait") or implement a polling loop that repeatedly checks for
resource/release removal (e.g., loop on "helm status test" or "kubectl get all
-l release=test" until not found, with a configurable timeout) before running
"helm install test \"${script_dir}\""; update the script to surface a clear
timeout/error if the release/resources do not disappear in time.


wait_for_cluster_ready
Loading
Loading