Skip to content

Conversation

@buremba
Copy link
Member

@buremba buremba commented Jan 24, 2026

Summary

  • Add slack.enabled: true to community values (was defaulting to false, causing SLACK_BOT_TOKEN env var to not be rendered)
  • Add cleanup step in deploy workflow to remove stale orchestrator/dispatcher deployments from old architecture
  • Remove obsolete orchestrator log reference from verification step

Root Cause

The production deployment has been failing since October 2025 because:

  1. Missing slack.enabled - The values-community.yaml had slack configuration but not enabled: true, so it defaulted to false from values.yaml
  2. Stale old deployments - Old orchestrator and dispatcher deployments still existed in the cluster, referencing ConfigMap/Secret keys that no longer exist (demo-repository, database-url)
  3. Obsolete log reference - The verification step was trying to get logs from the orchestrator deployment which no longer exists

Changes

  • charts/peerbot/values-community.yaml: Added slack.enabled: true and status reaction configs
  • .github/workflows/deploy-community.yml:
    • Added cleanup step to delete stale orchestrator/dispatcher resources
    • Removed orchestrator log reference from verification

Test Plan


Note

Resolves community deployment failures by enabling Slack and cleaning up legacy K8s resources before Helm deploys.

  • Enables slack in values-community.yaml and adds enableStatusReactions/enableProgressUpdates
  • Adds "Cleanup Stale Resources" step in deploy-community.yml to delete old orchestrator/dispatcher deployments and orphaned pods
  • Removes obsolete orchestrator log fetch from the verification step

Written by Cursor Bugbot for commit dcd6eff. This will update automatically on new commits. Configure here.

- Add slack.enabled: true to community values (was defaulting to false)
- Add cleanup step in deploy workflow to remove stale orchestrator/dispatcher
  deployments from old architecture
- Remove obsolete orchestrator log reference from verification step
- Add enableStatusReactions and enableProgressUpdates to slack config
@claude
Copy link
Contributor

claude bot commented Jan 24, 2026

Claude finished @buremba's task —— View job


Pull Request Review

I've completed a comprehensive review of the K8s deployment fixes. Here's my detailed analysis:

Review Tasks:

  • Read and analyze changed files
  • Review workflow deployment cleanup logic
  • Review Helm values configuration changes
  • Assess code quality and best practices
  • Check for potential bugs or security issues
  • Evaluate performance considerations
  • Provide comprehensive feedback

🎯 Overall Assessment: APPROVED

This PR effectively addresses the stated root causes and follows solid DevOps practices. The changes are minimal, targeted, and well-documented.


📋 Code Quality & Best Practices

Strengths

  1. Well-structured cleanup logic (.github/workflows/deploy-community.yml:78-99):

    • Proper error handling with --ignore-not-found=true
    • Clear logging with emoji indicators for better readability
    • Graceful fallback with || true for non-critical operations
  2. Comprehensive deployment verification (lines 177-193):

    • Multi-step verification process (pods, services, logs)
    • Appropriate timeout values (300s for pod readiness)
  3. Good security practices:

    • Secrets properly injected as environment variables
    • No hardcoded credentials in workflow file

🔧 Minor Improvements

  1. Resource cleanup could be more thorough (lines 94-96):
    # Consider adding cleanup for other potential orphaned resources
    kubectl delete pods -n $NAMESPACE -l app.kubernetes.io/component=orchestrator --ignore-not-found=true 2>/dev/null || true
    Suggestion: Consider also cleaning up associated services, configmaps, or secrets from old deployments.

🐛 Potential Issues

⚠️ Low-Risk Concerns

  1. Race condition in cleanup (line 83-92): The kubectl get deployment check followed by deletion could race if another process is modifying deployments simultaneously. This is unlikely in practice but worth noting.

  2. Missing timeout on cleanup operations: The cleanup steps don't have explicit timeouts, though this is generally not critical for delete operations.

🔍 Validation Issues

  1. Slack configuration validation (charts/peerbot/values-community.yaml:77-85):
    • The fix correctly adds enabled: true which was the root cause
    • However, there's a potential mismatch: socketMode: false while the default values.yaml:99 shows socketMode: true
    • Question: Is HTTP mode vs Socket mode intentional for the community deployment?

🔒 Security Analysis

Security Practices

  1. Proper secret handling: All sensitive values are injected via GitHub secrets
  2. Network isolation maintained: No changes to security contexts or RBAC
  3. Principle of least privilege: Cleanup only targets specific, known resources

🟢 No Security Concerns Identified

The changes don't introduce any new security vulnerabilities or reduce existing security posture.


Performance Considerations

Efficient Operations

  1. Cleanup is fast and targeted: Uses specific label selectors and resource types
  2. Deployment timeout is reasonable: 5-minute Helm timeout balances responsiveness with reliability
  3. Resource quotas unchanged: The values file maintains appropriate resource limits

💡 Performance Notes

  • The added cleanup step will add ~10-30 seconds to deployment time, which is acceptable for the reliability benefit
  • Verification steps are appropriately scoped and won't significantly impact deployment time

🧪 Test Coverage

Missing Test Coverage

  1. No automated tests for cleanup logic: The cleanup steps aren't covered by automated tests, though this is typical for infrastructure automation
  2. Manual test plan provided: The PR includes a reasonable manual test plan in the description

💡 Recommendations

Consider adding integration tests that:

  • Verify cleanup of stale resources in a test environment
  • Validate that deployments succeed after cleanup

🎯 Root Cause Analysis Validation

All Root Causes Addressed

  1. ✅ Missing slack.enabled: Fixed in values-community.yaml:78
  2. ✅ Stale old deployments: Addressed by cleanup steps in workflow
  3. ✅ Obsolete log reference: Removed orchestrator log reference (line 192 now references gateway)

The fixes directly address each identified issue with appropriate solutions.


🚀 Recommendations

  1. Consider adding a dry-run option for the cleanup step to preview what will be deleted
  2. Add monitoring/alerting for deployment failures to catch similar issues faster in the future
  3. Document the architecture change that led to these orphaned resources to prevent recurrence

Final Verdict

This PR is ready to merge. It's a clean, targeted fix that addresses real production issues with minimal risk and good engineering practices.

Risk Level: Low
Merge Confidence: High


@mesa-dot-dev
Copy link

mesa-dot-dev bot commented Jan 24, 2026

Mesa Description

TL;DR

Fixed K8s deployment failures in community environment by enabling Slack integration, cleaning up stale orchestrator/dispatcher deployments from old architecture, and removing obsolete log references.

What changed?

  • charts/peerbot/values-community.yaml: Added slack.enabled: true and status reaction configurations to fix missing SLACK_BOT_TOKEN environment variable
  • .github/workflows/deploy-community.yml: Added cleanup step to remove stale orchestrator/dispatcher deployments that were referencing non-existent ConfigMap/Secret keys; removed obsolete orchestrator log reference from verification step

Description generated by Mesa. Update settings

Copy link

@mesa-dot-dev mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performed full review of c34ab3c...dcd6eff

Analysis

  1. Critical service-selector mismatch in service-loadbalancer.yaml still references "dispatcher" component instead of "gateway", causing the LoadBalancer service to fail matching any pods and becoming non-functional.

  2. Incomplete architecture migration cleanup - while deployments are fixed, legacy "dispatcher" terminology remains throughout the codebase in environment variables, comments, and annotations, creating confusion.

  3. Configuration inheritance pattern issue discovered where platform enablement flags (e.g., slack.enabled: true) must be explicitly set to override defaults, even when complete configurations exist.

  4. Despite the kubectl-based cleanup approach being appropriate for immediate deployment failures, a more comprehensive migration guide or documentation update would help prevent similar issues.

Tip

Help

Slash Commands:

  • /review - Request a full code review
  • /review latest - Review only changes since the last review
  • /describe - Generate PR description. This will update the PR body or issue comment depending on your configuration
  • /help - Get help with Mesa commands and configuration options

0 files reviewed | 1 comments | Edit Agent SettingsRead Docs


# Slack configuration
slack:
enabled: true # Enable Slack platform
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium

Good fix! This resolves the root cause where SLACK_BOT_TOKEN wasn't being rendered as an environment variable in the gateway deployment. The template at charts/peerbot/templates/gateway-deployment.yaml conditionally includes Slack env vars only when slack.enabled: true.

Note: There's a related issue in charts/peerbot/templates/service-loadbalancer.yaml:20 where the selector still uses app.kubernetes.io/component: dispatcher instead of gateway. This would prevent the LoadBalancer from routing traffic to pods. Consider filing a follow-up issue to fix this.

Agent: 🏛 Architecture • Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#93
File: charts/peerbot/values-community.yaml#L78
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
Good fix! This resolves the root cause where SLACK_BOT_TOKEN wasn't being rendered as an environment variable in the gateway deployment. The template at `charts/peerbot/templates/gateway-deployment.yaml` conditionally includes Slack env vars only when `slack.enabled: true`.

Note: There's a related issue in `charts/peerbot/templates/service-loadbalancer.yaml:20` where the selector still uses `app.kubernetes.io/component: dispatcher` instead of `gateway`. This would prevent the LoadBalancer from routing traffic to pods. Consider filing a follow-up issue to fix this.

@buremba buremba merged commit 9d48358 into main Jan 24, 2026
14 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant