fix: resolve K8s deployment issues #93

buremba · 2026-01-24T03:25:08Z

Summary

Add slack.enabled: true to community values (was defaulting to false, causing SLACK_BOT_TOKEN env var to not be rendered)
Add cleanup step in deploy workflow to remove stale orchestrator/dispatcher deployments from old architecture
Remove obsolete orchestrator log reference from verification step

Root Cause

The production deployment has been failing since October 2025 because:

Missing slack.enabled - The values-community.yaml had slack configuration but not enabled: true, so it defaulted to false from values.yaml
Stale old deployments - Old orchestrator and dispatcher deployments still existed in the cluster, referencing ConfigMap/Secret keys that no longer exist (demo-repository, database-url)
Obsolete log reference - The verification step was trying to get logs from the orchestrator deployment which no longer exists

Changes

charts/peerbot/values-community.yaml: Added slack.enabled: true and status reaction configs
.github/workflows/deploy-community.yml:
- Added cleanup step to delete stale orchestrator/dispatcher resources
- Removed orchestrator log reference from verification

Test Plan

Merge PR
Trigger deployment workflow manually
Verify gateway pod starts successfully
Verify health endpoint is accessible at https://gateway.peerbot.ai/health

Note

Resolves community deployment failures by enabling Slack and cleaning up legacy K8s resources before Helm deploys.

Enables slack in values-community.yaml and adds enableStatusReactions/enableProgressUpdates
Adds "Cleanup Stale Resources" step in deploy-community.yml to delete old orchestrator/dispatcher deployments and orphaned pods
Removes obsolete orchestrator log fetch from the verification step

^{Written by Cursor Bugbot for commit dcd6eff. This will update automatically on new commits. Configure here.}

- Add slack.enabled: true to community values (was defaulting to false) - Add cleanup step in deploy workflow to remove stale orchestrator/dispatcher deployments from old architecture - Remove obsolete orchestrator log reference from verification step - Add enableStatusReactions and enableProgressUpdates to slack config

claude · 2026-01-24T03:25:25Z

Claude finished @buremba's task —— View job

Pull Request Review

I've completed a comprehensive review of the K8s deployment fixes. Here's my detailed analysis:

Review Tasks:

Read and analyze changed files
Review workflow deployment cleanup logic
Review Helm values configuration changes
Assess code quality and best practices
Check for potential bugs or security issues
Evaluate performance considerations
Provide comprehensive feedback

🎯 Overall Assessment: APPROVED

This PR effectively addresses the stated root causes and follows solid DevOps practices. The changes are minimal, targeted, and well-documented.

📋 Code Quality & Best Practices

✅ Strengths

Well-structured cleanup logic (.github/workflows/deploy-community.yml:78-99):
- Proper error handling with --ignore-not-found=true
- Clear logging with emoji indicators for better readability
- Graceful fallback with || true for non-critical operations
Comprehensive deployment verification (lines 177-193):
- Multi-step verification process (pods, services, logs)
- Appropriate timeout values (300s for pod readiness)
Good security practices:
- Secrets properly injected as environment variables
- No hardcoded credentials in workflow file

🔧 Minor Improvements

Resource cleanup could be more thorough (lines 94-96):

# Consider adding cleanup for other potential orphaned resources
kubectl delete pods -n $NAMESPACE -l app.kubernetes.io/component=orchestrator --ignore-not-found=true 2>/dev/null || true

Suggestion: Consider also cleaning up associated services, configmaps, or secrets from old deployments.

🐛 Potential Issues

⚠️ Low-Risk Concerns

Race condition in cleanup (line 83-92): The kubectl get deployment check followed by deletion could race if another process is modifying deployments simultaneously. This is unlikely in practice but worth noting.
Missing timeout on cleanup operations: The cleanup steps don't have explicit timeouts, though this is generally not critical for delete operations.

🔍 Validation Issues

Slack configuration validation (charts/peerbot/values-community.yaml:77-85):
- The fix correctly adds enabled: true which was the root cause
- However, there's a potential mismatch: socketMode: false while the default values.yaml:99 shows socketMode: true
- Question: Is HTTP mode vs Socket mode intentional for the community deployment?

🔒 Security Analysis

✅ Security Practices

Proper secret handling: All sensitive values are injected via GitHub secrets
Network isolation maintained: No changes to security contexts or RBAC
Principle of least privilege: Cleanup only targets specific, known resources

🟢 No Security Concerns Identified

The changes don't introduce any new security vulnerabilities or reduce existing security posture.

⚡ Performance Considerations

✅ Efficient Operations

Cleanup is fast and targeted: Uses specific label selectors and resource types
Deployment timeout is reasonable: 5-minute Helm timeout balances responsiveness with reliability
Resource quotas unchanged: The values file maintains appropriate resource limits

💡 Performance Notes

The added cleanup step will add ~10-30 seconds to deployment time, which is acceptable for the reliability benefit
Verification steps are appropriately scoped and won't significantly impact deployment time

🧪 Test Coverage

❌ Missing Test Coverage

No automated tests for cleanup logic: The cleanup steps aren't covered by automated tests, though this is typical for infrastructure automation
Manual test plan provided: The PR includes a reasonable manual test plan in the description

💡 Recommendations

Consider adding integration tests that:

Verify cleanup of stale resources in a test environment
Validate that deployments succeed after cleanup

🎯 Root Cause Analysis Validation

✅ All Root Causes Addressed

✅ Missing slack.enabled: Fixed in values-community.yaml:78
✅ Stale old deployments: Addressed by cleanup steps in workflow
✅ Obsolete log reference: Removed orchestrator log reference (line 192 now references gateway)

The fixes directly address each identified issue with appropriate solutions.

🚀 Recommendations

Consider adding a dry-run option for the cleanup step to preview what will be deleted
Add monitoring/alerting for deployment failures to catch similar issues faster in the future
Document the architecture change that led to these orphaned resources to prevent recurrence

✅ Final Verdict

This PR is ready to merge. It's a clean, targeted fix that addresses real production issues with minimal risk and good engineering practices.

Risk Level: Low
Merge Confidence: High

mesa-dot-dev · 2026-01-24T03:25:33Z

Mesa Description

TL;DR

Fixed K8s deployment failures in community environment by enabling Slack integration, cleaning up stale orchestrator/dispatcher deployments from old architecture, and removing obsolete log references.

What changed?

charts/peerbot/values-community.yaml: Added slack.enabled: true and status reaction configurations to fix missing SLACK_BOT_TOKEN environment variable
.github/workflows/deploy-community.yml: Added cleanup step to remove stale orchestrator/dispatcher deployments that were referencing non-existent ConfigMap/Secret keys; removed obsolete orchestrator log reference from verification step

^{Description generated by Mesa. Update settings}

mesa-dot-dev

Performed full review of c34ab3c...dcd6eff

Analysis

Critical service-selector mismatch in service-loadbalancer.yaml still references "dispatcher" component instead of "gateway", causing the LoadBalancer service to fail matching any pods and becoming non-functional.
Incomplete architecture migration cleanup - while deployments are fixed, legacy "dispatcher" terminology remains throughout the codebase in environment variables, comments, and annotations, creating confusion.
Configuration inheritance pattern issue discovered where platform enablement flags (e.g., slack.enabled: true) must be explicitly set to override defaults, even when complete configurations exist.
Despite the kubectl-based cleanup approach being appropriate for immediate deployment failures, a more comprehensive migration guide or documentation update would help prevent similar issues.

Tip

Help

Slash Commands:

/review - Request a full code review
/review latest - Review only changes since the last review
/describe - Generate PR description. This will update the PR body or issue comment depending on your configuration
/help - Get help with Mesa commands and configuration options

^{0 files reviewed | 1 comments | Edit Agent Settings • Read Docs}

mesa-dot-dev · 2026-01-24T03:30:12Z

charts/peerbot/values-community.yaml


 # Slack configuration
 slack:
+  enabled: true  # Enable Slack platform


Good fix! This resolves the root cause where SLACK_BOT_TOKEN wasn't being rendered as an environment variable in the gateway deployment. The template at charts/peerbot/templates/gateway-deployment.yaml conditionally includes Slack env vars only when slack.enabled: true.

Note: There's a related issue in charts/peerbot/templates/service-loadbalancer.yaml:20 where the selector still uses app.kubernetes.io/component: dispatcher instead of gateway. This would prevent the LoadBalancer from routing traffic to pods. Consider filing a follow-up issue to fix this.

Agent: 🏛 Architecture • •

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: buremba/peerbot#93 File: charts/peerbot/values-community.yaml#L78 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: Good fix! This resolves the root cause where SLACK_BOT_TOKEN wasn't being rendered as an environment variable in the gateway deployment. The template at `charts/peerbot/templates/gateway-deployment.yaml` conditionally includes Slack env vars only when `slack.enabled: true`. Note: There's a related issue in `charts/peerbot/templates/service-loadbalancer.yaml:20` where the selector still uses `app.kubernetes.io/component: dispatcher` instead of `gateway`. This would prevent the LoadBalancer from routing traffic to pods. Consider filing a follow-up issue to fix this.

mesa-dot-dev bot reviewed Jan 24, 2026

View reviewed changes

buremba merged commit 9d48358 into main Jan 24, 2026
14 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve K8s deployment issues #93

fix: resolve K8s deployment issues #93

Uh oh!

buremba commented Jan 24, 2026 •

edited by cursor bot

Loading

Uh oh!

claude bot commented Jan 24, 2026 •

edited

Loading

Uh oh!

mesa-dot-dev bot commented Jan 24, 2026

Uh oh!

mesa-dot-dev bot left a comment

Uh oh!

mesa-dot-dev bot Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: resolve K8s deployment issues #93

fix: resolve K8s deployment issues #93

Uh oh!

Conversation

buremba commented Jan 24, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Test Plan

Uh oh!

claude bot commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Review

Review Tasks:

🎯 Overall Assessment: APPROVED

📋 Code Quality & Best Practices

✅ Strengths

🔧 Minor Improvements

🐛 Potential Issues

⚠️ Low-Risk Concerns

🔍 Validation Issues

🔒 Security Analysis

✅ Security Practices

🟢 No Security Concerns Identified

⚡ Performance Considerations

✅ Efficient Operations

💡 Performance Notes

🧪 Test Coverage

❌ Missing Test Coverage

💡 Recommendations

🎯 Root Cause Analysis Validation

✅ All Root Causes Addressed

🚀 Recommendations

✅ Final Verdict

Uh oh!

mesa-dot-dev bot commented Jan 24, 2026

Mesa Description

TL;DR

What changed?

Uh oh!

mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

Analysis

Uh oh!

mesa-dot-dev bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

buremba commented Jan 24, 2026 •

edited by cursor bot

Loading

claude bot commented Jan 24, 2026 •

edited

Loading