-
Notifications
You must be signed in to change notification settings - Fork 0
feat: migrate gateway to Hono and remove Express from worker #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous Redis chart version (20.x) was using image tags that no longer exist on Docker Hub (7.4.3-debian-12-r0). Updated to v24.x which uses Redis 8.4.0 with current image tags.
Add trace ID propagation through the entire message pipeline for
observability in Grafana/Loki:
- Add trace.ts with generateTraceId() and extractTraceId() utilities
- Generate traceId at WhatsApp message ingestion
- Propagate traceId through queue processing and Sentry scope
- Include traceId in worker token and TRACE_ID env var
- Add traceId to K8s deployment/PVC annotations and logs
- Log responses with traceId in WhatsApp renderer
- Use traceId throughout worker SSE client lifecycle logs
- Add JSON log format option (LOG_FORMAT=json) for Loki parsing
- Create Grafana Loki datasource ConfigMap template
Trace format: tr-{messageId8}-{timestamp36}-{random4}
Example queries:
{namespace="peerbot"} |= "tr-abc12345"
{namespace="peerbot"} | json | traceId="tr-abc12345-lx4k-a3b2"
- Replace Express with Hono and @hono/zod-openapi for type-safe APIs - Rename sessions API to agents (/api/v1/agents/*) - Add settings page with magic link authentication - Add scheduled wakeup service for agent reminders - Add channel binding service for multi-platform support - Remove deprecated packages/github (merged into git-filesystem module) - Add K8s observability: ServiceMonitor, Grafana dashboard - Add gVisor installer and SealedSecrets for production security
Replace Express with native Node.js http module for MCP SSE server. The MCP SDK's SSEServerTransport only needs standard http.ServerResponse, so Express was unnecessary overhead. - Remove express and cors dependencies from worker - Rewrite startHTTPServer to use native http.createServer - Handle CORS headers and JSON parsing manually
|
Claude finished @buremba's task —— View job Code Review: Express to Hono MigrationComprehensive review of this major architectural migration from Express to Hono framework. Review Checklist
OverviewThis is an exceptionally well-executed migration that demonstrates thorough architectural planning and attention to detail. The Express to Hono migration brings significant benefits while maintaining backward compatibility through clever adapter patterns. ✅ Strengths1. Framework Migration Quality
2. API Design Improvements
3. Security Enhancements
4. Observability Excellence
5. Deployment Flexibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
This is the final PR Bugbot will review for you during this billing cycle
Your free Bugbot reviews will reset on March 4
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| {{- else }} | ||
| emptyDir: {} | ||
| {{- end }} | ||
| {{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing PVC template for git-cache volume
High Severity
The gateway deployment references a PersistentVolumeClaim named {{ include "peerbot.fullname" . }}-git-cache when gitCache.persistence.enabled is true, but no corresponding PVC template exists in the Helm chart to create this resource. When someone enables gitCache.enabled: true with gitCache.persistence.enabled: true in their values, the deployment will fail with a "persistentvolumeclaim not found" error because the PVC is never created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21fee093a5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RESPONSE=$(curl -s -X POST http://localhost:8080/api/messaging/send \ | ||
| -H "Authorization: Bearer $AUTH_TOKEN" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update test script to use new messaging endpoint
The gateway now registers the messaging API at /api/v1/messaging/send (see packages/gateway/src/routes/public/messaging.ts), but this script still posts to /api/messaging/send, which will 404 after the migration. Even if it were routed, the new handler’s response only returns messageId/eventsUrl/queued, while the script later expects .channel and .threadId, so it won’t be able to track thread replies. This makes ./scripts/test-bot.sh fail in the normal post‑migration workflow.
Useful? React with 👍 / 👎.
Mesa DescriptionPerfect! Now I have enough context to write the high-level PR description. Let me create it based on the template provided: TL;DRMigrated gateway from Express to Hono with OpenAPI/Zod validation, renamed sessions API to agents ( What changed?Gateway Framework Migration
API Updates
New Features
Worker Package
Kubernetes & Infrastructure
Package Cleanup
Description generated by Mesa. Update settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performed full review of 9d48358...21fee09
Analysis
-
Zod Version Mismatch: Critical version conflict between gateway (Zod v4.1.12) and worker (Zod v3.24.1) may cause runtime type mismatches for shared types and schemas, particularly through @peerbot/core.
-
Breaking API Changes: Sessions endpoints have been renamed to agents (/api/v1/agents/*) without backward compatibility layer or deprecation warnings, potentially breaking existing integrations.
-
Express Framework Removal: Migration from Express to Hono in the gateway and removal of Express from worker (replaced with native Node.js http) affects critical request handling paths and requires thorough testing, particularly for the MCP SSE transport.
-
Environment Variable Rename: The change from MODEL to AGENT_DEFAULT_MODEL in K8s deployment may break existing configurations without proper migration.
-
New Services Integration: Newly added scheduled wakeup service (using Redis+BullMQ) and channel binding service require production validation, especially the cron-based task scheduler.
Tip
Help
Slash Commands:
/review- Request a full code review/review latest- Review only changes since the last review/describe- Generate PR description. This will update the PR body or issue comment depending on your configuration/help- Get help with Mesa commands and configuration options
0 files reviewed | 8 comments | Edit Agent Settings • Read Docs
| "form-data": "^4.0.4", | ||
| "zod": "^4.1.12" | ||
| "zod": "^3.24.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical version mismatch: Worker uses Zod v3.24.1 while Gateway uses v4.1.12. This creates a major version incompatibility that could cause runtime type errors, especially since:
- Worker imports Zod in multiple files (sse-client.ts, custom-tools.ts, mcp-server.ts)
- Packages share types through @peerbot/core which may use Zod schemas
- Gateway-worker communication may serialize/deserialize Zod-validated objects
Zod v3 to v4 includes breaking changes in schema definitions and validation behavior. Recommend aligning both packages to use the same Zod version (preferably v4.1.12 to match gateway) or carefully audit all shared type boundaries.
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/worker/package.json#L42
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
Critical version mismatch: Worker uses Zod v3.24.1 while Gateway uses v4.1.12. This creates a major version incompatibility that could cause runtime type errors, especially since:
1. Worker imports Zod in multiple files (sse-client.ts, custom-tools.ts, mcp-server.ts)
2. Packages share types through @peerbot/core which may use Zod schemas
3. Gateway-worker communication may serialize/deserialize Zod-validated objects
Zod v3 to v4 includes breaking changes in schema definitions and validation behavior. Recommend aligning both packages to use the same Zod version (preferably v4.1.12 to match gateway) or carefully audit all shared type boundaries.
| const intervalMs = second.getTime() - first.getTime(); | ||
| const intervalMinutes = intervalMs / (60 * 1000); | ||
| if (intervalMinutes < MIN_CRON_INTERVAL_MINUTES) { | ||
| return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cron validation logic calculates the interval between first and second occurrences, but this doesn't catch edge cases where intervals vary (e.g., "0 0 1 * *" runs monthly with different day counts). Consider:
- Checking multiple intervals (first 3-4 occurrences) to detect varying intervals
- Warning users about variable-interval cron expressions
- Using the minimum observed interval for validation rather than just the first interval
Example problematic cron: "0 0 1,15 * *" has 14-16 day intervals that alternate.
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/gateway/src/orchestration/scheduled-wakeup.ts#L293
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
The cron validation logic calculates the interval between first and second occurrences, but this doesn't catch edge cases where intervals vary (e.g., "0 0 1 * *" runs monthly with different day counts). Consider:
1. Checking multiple intervals (first 3-4 occurrences) to detect varying intervals
2. Warning users about variable-interval cron expressions
3. Using the minimum observed interval for validation rather than just the first interval
Example problematic cron: "0 0 1,15 * *" has 14-16 day intervals that alternate.
| schedule.cron | ||
| ) { | ||
| try { | ||
| // Calculate next trigger from cron |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cron next trigger calculation happens AFTER the current iteration is processed, but doesn't account for the processing time. If the job processing takes longer than the cron interval (e.g., processing takes 6 min but interval is 5 min), the calculated delay could be negative or very short, causing immediate re-triggering or skipped iterations. Consider:
- Using
Date.now()instead of relying on schedule's triggerAt for the next calculation - Adding validation that
delayMs > 0and skipping to the following trigger if behind schedule - Warning when execution duration exceeds the cron interval
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/gateway/src/orchestration/scheduled-wakeup.ts#L573
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
The cron next trigger calculation happens AFTER the current iteration is processed, but doesn't account for the processing time. If the job processing takes longer than the cron interval (e.g., processing takes 6 min but interval is 5 min), the calculated delay could be negative or very short, causing immediate re-triggering or skipped iterations. Consider:
1. Using `Date.now()` instead of relying on schedule's triggerAt for the next calculation
2. Adding validation that `delayMs > 0` and skipping to the following trigger if behind schedule
3. Warning when execution duration exceeds the cron interval
| } | ||
|
|
||
| function resolveExecCwd(baseDir: string, requested?: string): string | null { | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exec cwd validation uses fs.realpathSync() which will throw if the path doesn't exist yet. This means users cannot specify a cwd that they intend to create during command execution. Consider:
- Using
fs.existsSync()+ path normalization for non-existent paths - Still resolving symlinks for existing paths to prevent escapes
- Performing pure path-based containment checks for non-existent directories
Example: User wants to run mkdir newdir && cd newdir && npm init with cwd="newdir"
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/gateway/src/routes/public/agent.ts#L279
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
The exec cwd validation uses `fs.realpathSync()` which will throw if the path doesn't exist yet. This means users cannot specify a cwd that they intend to create during command execution. Consider:
1. Using `fs.existsSync()` + path normalization for non-existent paths
2. Still resolving symlinks for existing paths to prevent escapes
3. Performing pure path-based containment checks for non-existent directories
Example: User wants to run `mkdir newdir && cd newdir && npm init` with cwd="newdir"
| logger.info(`SSE connection closed for agent ${agentId}`); | ||
| }); | ||
|
|
||
| while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SSE stream has an infinite loop with while (true) that only exits on abort. This keeps the stream alive but the heartbeat at line 860 could fail silently and the connection would appear alive but be broken. Consider:
- Tracking heartbeat failures and closing the connection after N consecutive failures
- Adding connection timeout if no client acknowledgment is received
- Implementing proper keepalive with client-side ping/pong
This prevents zombie connections that consume resources but can't actually deliver events.
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/gateway/src/routes/public/agent.ts#L880
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
The SSE stream has an infinite loop with `while (true)` that only exits on abort. This keeps the stream alive but the heartbeat at line 860 could fail silently and the connection would appear alive but be broken. Consider:
1. Tracking heartbeat failures and closing the connection after N consecutive failures
2. Adding connection timeout if no client acknowledgment is received
3. Implementing proper keepalive with client-side ping/pong
This prevents zombie connections that consume resources but can't actually deliver events.
| } | ||
| if (config.command) { | ||
| const dangerousCommands = [ | ||
| "rm", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dangerous commands list is too restrictive and incomplete. Issues:
sh,bash,zshare blocked but users legitimately need shells for MCP servers (e.g., Python scripts with shell wrappers)- Missing dangerous patterns like
eval,exec, command substitution$(...), pipe to shell| sh - Only checks the basename, so
/usr/bin/curlbypasses the block curlandwgetare legitimate for HTTP-based MCP servers
Recommend:
- Remove shell and curl/wget from the blocklist
- Rely on sandboxing (gVisor, namespaces) for security
- OR implement more sophisticated command parsing to detect injection patterns
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/gateway/src/routes/public/agent.ts#L234
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
The dangerous commands list is too restrictive and incomplete. Issues:
1. `sh`, `bash`, `zsh` are blocked but users legitimately need shells for MCP servers (e.g., Python scripts with shell wrappers)
2. Missing dangerous patterns like `eval`, `exec`, command substitution `$(...)`, pipe to shell `| sh`
3. Only checks the basename, so `/usr/bin/curl` bypasses the block
4. `curl` and `wget` are legitimate for HTTP-based MCP servers
Recommend:
- Remove shell and curl/wget from the blocklist
- Rely on sandboxing (gVisor, namespaces) for security
- OR implement more sophisticated command parsing to detect injection patterns
| req: import("node:http").IncomingMessage | ||
| ): Promise<unknown> { | ||
| return new Promise((resolve, reject) => { | ||
| let body = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON parsing does not enforce a maximum body size, making this endpoint vulnerable to memory exhaustion attacks. Consider adding:
- A size limit check using
content-lengthheader - A streaming body size counter that rejects oversized requests
- A reasonable limit like 10MB for MCP messages
Example attack: POST with gigabytes of JSON data could crash the worker process.
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/worker/src/mcp/mcp-server.ts#L436
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
The JSON parsing does not enforce a maximum body size, making this endpoint vulnerable to memory exhaustion attacks. Consider adding:
1. A size limit check using `content-length` header
2. A streaming body size counter that rejects oversized requests
3. A reasonable limit like 10MB for MCP messages
Example attack: POST with gigabytes of JSON data could crash the worker process.
| </div> | ||
|
|
||
| <script> | ||
| const token = ${JSON.stringify(token)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The settings token is being embedded directly into the page JavaScript without sanitization. While JSON.stringify() provides basic escaping, this could be vulnerable to XSS if the token contains malicious content that breaks out of the string context. Consider:
- Using a more robust escaping function specifically designed for embedding in
<script>tags - Passing the token via a
data-attribute on a DOM element and reading it with JavaScript - Setting the token via a separate
<script>tag that executes before the main script
Example attack vector: If token generation ever includes unescaped user input, an attacker could inject "</script><script>alert(1)</script>
Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: buremba/peerbot#94
File: packages/gateway/src/routes/public/settings-page.ts#L379
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.
Feedback:
The settings token is being embedded directly into the page JavaScript without sanitization. While `JSON.stringify()` provides basic escaping, this could be vulnerable to XSS if the token contains malicious content that breaks out of the string context. Consider:
1. Using a more robust escaping function specifically designed for embedding in `<script>` tags
2. Passing the token via a `data-` attribute on a DOM element and reading it with JavaScript
3. Setting the token via a separate `<script>` tag that executes before the main script
Example attack vector: If token generation ever includes unescaped user input, an attacker could inject `"</script><script>alert(1)</script>`


Summary
/api/v1/agents/*)packages/github(merged into git-filesystem module)Test plan
/api/v1/agents/*)🤖 Generated with Claude Code
Note
High Risk
High risk due to Kubernetes/Helm changes (new
tempodependency, security policy templates, gVisor installer, and gateway env/volume wiring) plus significant Dockerfile and dependency graph adjustments (removingpackages/github, adding Nix to worker image), which can break builds or deployments if misconfigured.Overview
Adds production-oriented observability and security tooling to the Helm chart: upgrades Redis subchart, introduces optional
tempotracing plus Grafana datasource/dashboard ConfigMaps, and adds optional OPA Gatekeeper constraint templates/constraints and an experimental gVisor installer/RuntimeClass.Updates the gateway deployment spec to wire in optional
TEMPO_ENDPOINT, introduce git cache/GitHub App credentials env vars + volume mounts, and renames the model env var fromMODELtoAGENT_DEFAULT_MODEL.Refactors build/runtime packaging: removes
packages/githubfrom Docker builds/workspaces andbun.lock, drops Express-related deps from the worker, and enhances the worker base image withghand a single-user Nix install. Documentation/Makefile are updated to reflect the new dev workflow and secret-handling guidance.Written by Cursor Bugbot for commit 21fee09. This will update automatically on new commits. Configure here.