Fix bot silent failures: broken logging and missing error handling#17
Fix bot silent failures: broken logging and missing error handling#17manana2520 merged 2 commits intomainfrom
Conversation
Root cause: Settings() triggers logging.warning() during import, which
implicitly calls basicConfig() with default level=WARNING. Our explicit
basicConfig(level=INFO) in main() was a no-op, silently dropping all
INFO-level logs. This made debugging impossible.
Fixes:
- Use force=True in basicConfig() to ensure INFO-level logging works
- Add try/except around say("Thinking...") with fallback to chat.postMessage
- Add exc_info=True to all error logs for stack traces
- Add INFO logs at every step in _handle_question for observability
There was a problem hiding this comment.
LGTM! This Pull Request delivers crucial improvements to the bot's reliability and observability, addressing a significant production issue. The fix for logging.basicConfig using force=True correctly resolves the silent dropping of INFO-level logs, which was a critical root cause. The added try/except blocks with fallbacks around say("Thinking...") and the final answer posting ensure the bot is much more resilient to Slack API failures. Furthermore, the strategic addition of exc_info=True to error logs and INFO-level logs at each step of _handle_question will be invaluable for diagnosing future issues. The changes align well with the project's architecture and introduce no new security concerns. Excellent work!
There was a problem hiding this comment.
🔒 Security Review
Security Score: D | 2 HIGH, 1 INFO, 1 LOW, 2 MEDIUM
Security Review Summary
Decision: REQUEST_CHANGES
This PR introduces valuable error handling and observability improvements to the Slack bot, but introduces two HIGH-severity PII logging violations that conflict with the repository's security context classification of Slack user_id and channel information as HIGH-sensitivity data.
Critical Issues
-
HIGH - PII Logging (Lines 572, 754): The PR logs Slack
user_idandchannelidentifiers at INFO level in production environments. Per the security context, Slack user IDs and channel information are classified as HIGH sensitivity. Logging these in production is visible to log aggregation systems and violates the principle of least privilege for sensitive data. -
MEDIUM - Unvalidated Error Messages: Exception messages from Slack API are logged without sanitization, potentially exposing API error details, rate-limiting information, or internal error codes.
-
MEDIUM - Inconsistent Null Safety: The code uses
.get('ts')in some places but bracket notation['ts']in others (line 651), creating KeyError risks if message structure is unexpected. -
LOW - Logging Format Regression: Timestamp was removed from the logging format, reducing observability in local and aggregated logging scenarios.
Positive Security Aspects
✓ The force=True parameter correctly addresses the basicConfig() override issue
✓ exc_info=True is appropriately added for stack trace visibility
✓ Error handling with try/except blocks prevents silent failures
✓ No credentials, API keys, or SLACK_SIGNING_SECRET are logged
✓ The SecretRedactingFilter() is still properly registered
Recommendations
- Required: Remove
user_idandchannelfrom INFO-level logs; use DEBUG level if needed for development - Required: Change line 754 logging to remove
user_id - Recommended: Sanitize exception messages before logging to avoid exposing API internals
- Recommended: Add null-safety validation for
thinking_msg['ts']access - Recommended: Restore timestamp to logging format or document the rationale
After addressing the HIGH-severity PII logging findings, this PR will be security-approved.
Findings
🟠 [HIGH] PII logging: Slack user_id and channel logged at INFO level in production
File: src/knowledge_base/slack/bot.py:line 572
Category: A09_logging_monitoring
Impact: Slack user identifiers and channel information are sensitive internal metadata that could be used to correlate user activity, identify specific individuals, or map organizational communication patterns if logs are compromised or exposed
Recommendation: Remove user_id and channel from the INFO-level log. Replace with: logger.info(f"Processing question from user thread") If debugging needs user context, use DEBUG level instead: logger.debug(f"Processing question from user={user_id} channel={channel}...") This respects the security context that these are HIGH-classification PII
🟠 [HIGH] PII logging: Slack user_id logged at INFO level in feedback modal operation
File: src/knowledge_base/slack/bot.py:line 754
Category: A09_logging_monitoring
Impact: Slack user_id is classified as HIGH sensitivity (per security context: 'Slack user IDs and usernames...stored in Neo4j episode metadata'). Logging at INFO level in production exposes this PII to log aggregation systems, audit trails, and potential security incidents
Recommendation: Remove user_id from INFO-level logging. Change to: logger.info(f"Opened {feedback_type} feedback modal") or use DEBUG level for PII: logger.debug(f"Opened {feedback_type} feedback modal for user {user_id}")
🟡 [MEDIUM] Potential information disclosure: Slack API error messages logged without sanitization
File: src/knowledge_base/slack/bot.py:line 603
Category: A09_logging_monitoring
Impact: Exception messages from Slack API calls (e.g., logger.error(f"Failed to post 'Thinking...' message: {e}", exc_info=True)) may contain API responses with internal Slack error codes, rate-limit information, or other operational details useful to attackers
Recommendation: Sanitize exception messages before logging: Create a helper function that strips API response bodies and logs only error type/code. Example: logger.error("Failed to post thinking message", exc_info=True) and let the stack trace provide debugging context without exposing API internals
🟡 [MEDIUM] Log message timestamp field access is unvalidated
File: src/knowledge_base/slack/bot.py:line 584
Category: A09_logging_monitoring
Impact: Code assumes thinking_msg.get('ts') will always return a valid value. If Slack API returns unexpected response format, unknown is logged, but in fallback paths the code accesses thinking_msg['ts'] with bracket notation (line 651) without null-safety checks, risking KeyError exceptions and incomplete error handling
Recommendation: Add explicit validation: ts = thinking_msg.get('ts'); if not ts: logger.error(...); return before using ts in subsequent message updates. Ensure consistent use of .get() with defaults throughout
🟢 [LOW] Logging format changed from descriptive to abbreviated format
File: src/knowledge_base/slack/bot.py:line 1015
Category: A05_misconfiguration
Impact: New format "%(levelname)s:%(name)s:%(message)s" removes timestamp and makes logs less useful for incident response and forensics. Changed from "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
Recommendation: Restore original timestamp-inclusive format or justify the change in commit message. Cloud Run logs have automatic timestamps, but local development and log aggregation systems benefit from inline timestamps
🔵 [INFO] Startup logging at INFO level may be visible in logs before authentication context is available
File: src/knowledge_base/slack/bot.py:line 1019
Category: A09_logging_monitoring
Impact: Log message 'Slack bot starting' appears early in startup, which is acceptable, but good practice suggests INFO-level startup messages may be verbose if logs are aggregated across many instances
Recommendation: Consider using DEBUG level for startup messages, or ensure log level is configurable via environment variable for different deployment environments
OWASP Top 10 Checklist
| Category | Status |
|---|---|
| A01 Access Control | ✅ PASS |
| A02 Crypto Failures | ✅ PASS |
| A03 Injection | ✅ PASS |
| A04 Insecure Design | ✅ PASS |
| A05 Misconfiguration | ❌ FAIL |
| A06 Vulnerable Components | ➖ N/A |
| A07 Auth Failures | ✅ PASS |
| A08 Integrity Failures | ✅ PASS |
| A09 Logging Monitoring | ❌ FAIL |
| A10 Ssrf | ➖ N/A |
🤖 Security review powered by Claude
Root cause: Settings() triggers logging.warning() during import, which
implicitly calls basicConfig() with default level=WARNING. Our explicit
basicConfig(level=INFO) in main() was a no-op, silently dropping all
INFO-level logs. This made debugging impossible.
Fixes:
- Use force=True in basicConfig() to ensure INFO-level logging works
- Add try/except around say("Thinking...") with fallback to chat.postMessage
- Add exc_info=True to all error logs for stack traces
- Add INFO logs at every step in _handle_question for observability
- Extract thinking_ts with .get() and validate before use
- Remove PII (user_id, channel) from INFO-level logs
There was a problem hiding this comment.
LGTM! This PR effectively addresses critical issues around silent failures and lack of observability in the bot.
Key improvements:
- Logging Fix: The root cause of
basicConfig()being silently overridden is correctly identified and fixed by addingforce=Trueto the explicitbasicConfig()call. This ensures INFO-level logs are no longer dropped. - Robust Error Handling: The
say("Thinking...")call now has comprehensivetry/exceptblocks, including a fallback mechanism toclient.chat_postMessageif the initialsayfails. All error logs now correctly includeexc_info=Truefor stack trace visibility. - Improved Observability: The addition of INFO-level logs at various stages of
_handle_questionprovides much-needed visibility into the bot's processing flow, which will greatly aid debugging. - Edge Case Handling: The explicit check for
thinking_msg.get("ts")before attempting to update or use the timestamp for feedback is a great touch, preventing potentialKeyErrorif the initial message post fails to return a timestamp.
This is a well-thought-out and executed PR that significantly improves the stability and diagnosability of the Slack bot.
There was a problem hiding this comment.
🔒 Security Review
Security Score: A | 2 INFO
Security Review Summary
Decision: APPROVE
Security Score: A (No findings or only INFO level)
Key Findings
✅ No Critical or High Severity Issues Detected
This PR successfully improves security posture while fixing logging infrastructure:
Positive Security Changes
-
Improved Error Handling (A09 - Logging & Monitoring)
- Adds
exc_info=Trueto all error logs, enabling stack trace visibility without logging credentials - Adds try/except wrapper around
say("Thinking...")with fallback toclient.chat_postMessage() - Prevents silent failures that were masked by logging configuration bug
- Adds
-
Enhanced Observability Without PII Leakage
- Line 572: Logs truncated question text (80 chars) - acceptable balance between debugging and privacy
- Line 759: Removes user_id from feedback modal log - positive improvement, prevents Slack user ID (PII) leakage
- Line 584, 602, 616, etc.: Strategic INFO logs at key decision points without exposing sensitive content
-
Correct Logging Configuration Fix
- Line 1019: Adds
force=Truetologging.basicConfig()with clear inline explanation - Properly resolves the root cause where
Settings()import triggered implicitbasicConfig()withlevel=WARNING - This is the correct fix for Python's logging idempotency behavior
- Line 1019: Adds
-
Safe Fallback Patterns
- Line 581-595: Two-level fallback for posting "Thinking..." message handles both
say()and directclient.chat_postMessage()failure modes - Both paths log errors with stack traces for troubleshooting
- Line 581-595: Two-level fallback for posting "Thinking..." message handles both
OWASP Compliance
- A01 (Access Control): PASS - No privilege escalation or unauthorized operations
- A02 (Cryptographic Failures): PASS - No credential exposure; Slack token remains protected
- A03 (Injection): PASS - No new injection vectors; error messages don't eval user input
- A04 (Insecure Design): PASS - Adds missing error handling; improves system resilience
- A05 (Misconfiguration): PASS - Logging configuration is now correct;
force=Trueproperly justified - A06 (Vulnerable Components): N/A - No dependency changes
- A07 (Authentication Failures): PASS - No changes to auth mechanism; Slack HMAC verification untouched
- A08 (Integrity Failures): PASS - No data corruption vectors; validation logic unchanged
- A09 (Logging & Monitoring): PASS - Adds strategic INFO logs without PII; removes user_id from one log line (security improvement); error logs include stack traces
- A10 (SSRF): PASS - No new URL handling or redirects
Repository-Specific Security Checks
✅ SLACK_BOT_TOKEN not logged at any level
✅ SLACK_SIGNING_SECRET verification not modified
✅ Slack user PII (user_id) removed from one log, truncated in another
✅ No LLM prompt contents or internal documentation logged
✅ No new credential hardcoding
✅ No Neo4j credentials exposed
✅ No ANTHROPIC_API_KEY exposure
Recommendation
Approve this PR. The changes fix a legitimate logging infrastructure bug that was masking production issues, improve error handling resilience, and actually enhance security by removing user ID from feedback logs. All changes follow secure logging principles per the repository's security context.
Findings
🔵 [INFO] Logging truncated user question text at INFO level
File: src/knowledge_base/slack/bot.py:line 572
Category: A09_logging_monitoring
Impact: Question content is truncated to 80 chars, limiting context exposure while preserving observability for debugging
Recommendation: Current implementation is acceptable. The truncation ([:80]) provides visibility without exposing full user input. No change needed.
🔵 [INFO] Removed user_id from log message in _handle_feedback_action
File: src/knowledge_base/slack/bot.py:line 759
Category: A09_logging_monitoring
Impact: Prevents Slack user ID (PII) from being logged at INFO level, improving privacy posture
Recommendation: This is a positive security improvement. No action required.
OWASP Top 10 Checklist
| Category | Status |
|---|---|
| A01 Access Control | ✅ PASS |
| A02 Crypto Failures | ✅ PASS |
| A03 Injection | ✅ PASS |
| A04 Insecure Design | ✅ PASS |
| A05 Misconfiguration | ✅ PASS |
| A06 Vulnerable Components | ➖ N/A |
| A07 Auth Failures | ✅ PASS |
| A08 Integrity Failures | ✅ PASS |
| A09 Logging Monitoring | ✅ PASS |
| A10 Ssrf | ✅ PASS |
🤖 Security review powered by Claude
Summary
logging.basicConfig()being silently overridden by implicit call duringSettings()import (usesforce=True)say("Thinking...")call which previously had no error handlingexc_info=Trueto all error logs for stack trace visibility_handle_questionfor full observabilityRoot Cause
Settings()inconfig.pytriggerslogging.warning("SECURITY WARNING: ADMIN_PASSWORD...")during module import. This implicitly callsbasicConfig()with defaultlevel=WARNING. Our explicitbasicConfig(level=INFO)inmain()was then a no-op (Python skips it if root logger already has handlers), silently dropping all INFO-level logs from the bot code.This made it impossible to diagnose why the production bot was processing queries (search + LLM) but not posting responses back to Slack -- no logs were visible.
Test plan