Skip to content

Fix bot silent failures: broken logging and missing error handling#17

Merged
manana2520 merged 2 commits intomainfrom
fix/bot-logging-and-reliability
Feb 22, 2026
Merged

Fix bot silent failures: broken logging and missing error handling#17
manana2520 merged 2 commits intomainfrom
fix/bot-logging-and-reliability

Conversation

@manana2520
Copy link
Contributor

Summary

  • Fix logging.basicConfig() being silently overridden by implicit call during Settings() import (uses force=True)
  • Add try/except with fallback around say("Thinking...") call which previously had no error handling
  • Add exc_info=True to all error logs for stack trace visibility
  • Add INFO-level logging at every step in _handle_question for full observability

Root Cause

Settings() in config.py triggers logging.warning("SECURITY WARNING: ADMIN_PASSWORD...") during module import. This implicitly calls basicConfig() with default level=WARNING. Our explicit basicConfig(level=INFO) in main() was then a no-op (Python skips it if root logger already has handlers), silently dropping all INFO-level logs from the bot code.

This made it impossible to diagnose why the production bot was processing queries (search + LLM) but not posting responses back to Slack -- no logs were visible.

Test plan

  • CI passes (unit tests, AI reviewer, security reviewer)
  • After deploy to staging: send message to staging bot, verify INFO logs appear in Cloud Run logs
  • Verify "Thinking..." message appears, answer replaces it, feedback buttons posted
  • If any Slack API call fails, error + stack trace should be visible in logs

Root cause: Settings() triggers logging.warning() during import, which
implicitly calls basicConfig() with default level=WARNING. Our explicit
basicConfig(level=INFO) in main() was a no-op, silently dropping all
INFO-level logs. This made debugging impossible.

Fixes:
- Use force=True in basicConfig() to ensure INFO-level logging works
- Add try/except around say("Thinking...") with fallback to chat.postMessage
- Add exc_info=True to all error logs for stack traces
- Add INFO logs at every step in _handle_question for observability
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This Pull Request delivers crucial improvements to the bot's reliability and observability, addressing a significant production issue. The fix for logging.basicConfig using force=True correctly resolves the silent dropping of INFO-level logs, which was a critical root cause. The added try/except blocks with fallbacks around say("Thinking...") and the final answer posting ensure the bot is much more resilient to Slack API failures. Furthermore, the strategic addition of exc_info=True to error logs and INFO-level logs at each step of _handle_question will be invaluable for diagnosing future issues. The changes align well with the project's architecture and introduce no new security concerns. Excellent work!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security Review

Security Score: D | 2 HIGH, 1 INFO, 1 LOW, 2 MEDIUM

Security Review Summary

Decision: REQUEST_CHANGES

This PR introduces valuable error handling and observability improvements to the Slack bot, but introduces two HIGH-severity PII logging violations that conflict with the repository's security context classification of Slack user_id and channel information as HIGH-sensitivity data.

Critical Issues

  1. HIGH - PII Logging (Lines 572, 754): The PR logs Slack user_id and channel identifiers at INFO level in production environments. Per the security context, Slack user IDs and channel information are classified as HIGH sensitivity. Logging these in production is visible to log aggregation systems and violates the principle of least privilege for sensitive data.

  2. MEDIUM - Unvalidated Error Messages: Exception messages from Slack API are logged without sanitization, potentially exposing API error details, rate-limiting information, or internal error codes.

  3. MEDIUM - Inconsistent Null Safety: The code uses .get('ts') in some places but bracket notation ['ts'] in others (line 651), creating KeyError risks if message structure is unexpected.

  4. LOW - Logging Format Regression: Timestamp was removed from the logging format, reducing observability in local and aggregated logging scenarios.

Positive Security Aspects

✓ The force=True parameter correctly addresses the basicConfig() override issue
exc_info=True is appropriately added for stack trace visibility
✓ Error handling with try/except blocks prevents silent failures
✓ No credentials, API keys, or SLACK_SIGNING_SECRET are logged
✓ The SecretRedactingFilter() is still properly registered

Recommendations

  1. Required: Remove user_id and channel from INFO-level logs; use DEBUG level if needed for development
  2. Required: Change line 754 logging to remove user_id
  3. Recommended: Sanitize exception messages before logging to avoid exposing API internals
  4. Recommended: Add null-safety validation for thinking_msg['ts'] access
  5. Recommended: Restore timestamp to logging format or document the rationale

After addressing the HIGH-severity PII logging findings, this PR will be security-approved.

Findings

🟠 [HIGH] PII logging: Slack user_id and channel logged at INFO level in production

File: src/knowledge_base/slack/bot.py:line 572
Category: A09_logging_monitoring
Impact: Slack user identifiers and channel information are sensitive internal metadata that could be used to correlate user activity, identify specific individuals, or map organizational communication patterns if logs are compromised or exposed
Recommendation: Remove user_id and channel from the INFO-level log. Replace with: logger.info(f"Processing question from user thread") If debugging needs user context, use DEBUG level instead: logger.debug(f"Processing question from user={user_id} channel={channel}...") This respects the security context that these are HIGH-classification PII

🟠 [HIGH] PII logging: Slack user_id logged at INFO level in feedback modal operation

File: src/knowledge_base/slack/bot.py:line 754
Category: A09_logging_monitoring
Impact: Slack user_id is classified as HIGH sensitivity (per security context: 'Slack user IDs and usernames...stored in Neo4j episode metadata'). Logging at INFO level in production exposes this PII to log aggregation systems, audit trails, and potential security incidents
Recommendation: Remove user_id from INFO-level logging. Change to: logger.info(f"Opened {feedback_type} feedback modal") or use DEBUG level for PII: logger.debug(f"Opened {feedback_type} feedback modal for user {user_id}")

🟡 [MEDIUM] Potential information disclosure: Slack API error messages logged without sanitization

File: src/knowledge_base/slack/bot.py:line 603
Category: A09_logging_monitoring
Impact: Exception messages from Slack API calls (e.g., logger.error(f"Failed to post 'Thinking...' message: {e}", exc_info=True)) may contain API responses with internal Slack error codes, rate-limit information, or other operational details useful to attackers
Recommendation: Sanitize exception messages before logging: Create a helper function that strips API response bodies and logs only error type/code. Example: logger.error("Failed to post thinking message", exc_info=True) and let the stack trace provide debugging context without exposing API internals

🟡 [MEDIUM] Log message timestamp field access is unvalidated

File: src/knowledge_base/slack/bot.py:line 584
Category: A09_logging_monitoring
Impact: Code assumes thinking_msg.get('ts') will always return a valid value. If Slack API returns unexpected response format, unknown is logged, but in fallback paths the code accesses thinking_msg['ts'] with bracket notation (line 651) without null-safety checks, risking KeyError exceptions and incomplete error handling
Recommendation: Add explicit validation: ts = thinking_msg.get('ts'); if not ts: logger.error(...); return before using ts in subsequent message updates. Ensure consistent use of .get() with defaults throughout

🟢 [LOW] Logging format changed from descriptive to abbreviated format

File: src/knowledge_base/slack/bot.py:line 1015
Category: A05_misconfiguration
Impact: New format "%(levelname)s:%(name)s:%(message)s" removes timestamp and makes logs less useful for incident response and forensics. Changed from "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
Recommendation: Restore original timestamp-inclusive format or justify the change in commit message. Cloud Run logs have automatic timestamps, but local development and log aggregation systems benefit from inline timestamps

🔵 [INFO] Startup logging at INFO level may be visible in logs before authentication context is available

File: src/knowledge_base/slack/bot.py:line 1019
Category: A09_logging_monitoring
Impact: Log message 'Slack bot starting' appears early in startup, which is acceptable, but good practice suggests INFO-level startup messages may be verbose if logs are aggregated across many instances
Recommendation: Consider using DEBUG level for startup messages, or ensure log level is configurable via environment variable for different deployment environments

OWASP Top 10 Checklist

Category Status
A01 Access Control ✅ PASS
A02 Crypto Failures ✅ PASS
A03 Injection ✅ PASS
A04 Insecure Design ✅ PASS
A05 Misconfiguration ❌ FAIL
A06 Vulnerable Components ➖ N/A
A07 Auth Failures ✅ PASS
A08 Integrity Failures ✅ PASS
A09 Logging Monitoring ❌ FAIL
A10 Ssrf ➖ N/A

🤖 Security review powered by Claude

Root cause: Settings() triggers logging.warning() during import, which
implicitly calls basicConfig() with default level=WARNING. Our explicit
basicConfig(level=INFO) in main() was a no-op, silently dropping all
INFO-level logs. This made debugging impossible.

Fixes:
- Use force=True in basicConfig() to ensure INFO-level logging works
- Add try/except around say("Thinking...") with fallback to chat.postMessage
- Add exc_info=True to all error logs for stack traces
- Add INFO logs at every step in _handle_question for observability
- Extract thinking_ts with .get() and validate before use
- Remove PII (user_id, channel) from INFO-level logs
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This PR effectively addresses critical issues around silent failures and lack of observability in the bot.

Key improvements:

  1. Logging Fix: The root cause of basicConfig() being silently overridden is correctly identified and fixed by adding force=True to the explicit basicConfig() call. This ensures INFO-level logs are no longer dropped.
  2. Robust Error Handling: The say("Thinking...") call now has comprehensive try/except blocks, including a fallback mechanism to client.chat_postMessage if the initial say fails. All error logs now correctly include exc_info=True for stack trace visibility.
  3. Improved Observability: The addition of INFO-level logs at various stages of _handle_question provides much-needed visibility into the bot's processing flow, which will greatly aid debugging.
  4. Edge Case Handling: The explicit check for thinking_msg.get("ts") before attempting to update or use the timestamp for feedback is a great touch, preventing potential KeyError if the initial message post fails to return a timestamp.

This is a well-thought-out and executed PR that significantly improves the stability and diagnosability of the Slack bot.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security Review

Security Score: A | 2 INFO

Security Review Summary

Decision: APPROVE

Security Score: A (No findings or only INFO level)

Key Findings

No Critical or High Severity Issues Detected

This PR successfully improves security posture while fixing logging infrastructure:

Positive Security Changes

  1. Improved Error Handling (A09 - Logging & Monitoring)

    • Adds exc_info=True to all error logs, enabling stack trace visibility without logging credentials
    • Adds try/except wrapper around say("Thinking...") with fallback to client.chat_postMessage()
    • Prevents silent failures that were masked by logging configuration bug
  2. Enhanced Observability Without PII Leakage

    • Line 572: Logs truncated question text (80 chars) - acceptable balance between debugging and privacy
    • Line 759: Removes user_id from feedback modal log - positive improvement, prevents Slack user ID (PII) leakage
    • Line 584, 602, 616, etc.: Strategic INFO logs at key decision points without exposing sensitive content
  3. Correct Logging Configuration Fix

    • Line 1019: Adds force=True to logging.basicConfig() with clear inline explanation
    • Properly resolves the root cause where Settings() import triggered implicit basicConfig() with level=WARNING
    • This is the correct fix for Python's logging idempotency behavior
  4. Safe Fallback Patterns

    • Line 581-595: Two-level fallback for posting "Thinking..." message handles both say() and direct client.chat_postMessage() failure modes
    • Both paths log errors with stack traces for troubleshooting

OWASP Compliance

  • A01 (Access Control): PASS - No privilege escalation or unauthorized operations
  • A02 (Cryptographic Failures): PASS - No credential exposure; Slack token remains protected
  • A03 (Injection): PASS - No new injection vectors; error messages don't eval user input
  • A04 (Insecure Design): PASS - Adds missing error handling; improves system resilience
  • A05 (Misconfiguration): PASS - Logging configuration is now correct; force=True properly justified
  • A06 (Vulnerable Components): N/A - No dependency changes
  • A07 (Authentication Failures): PASS - No changes to auth mechanism; Slack HMAC verification untouched
  • A08 (Integrity Failures): PASS - No data corruption vectors; validation logic unchanged
  • A09 (Logging & Monitoring): PASS - Adds strategic INFO logs without PII; removes user_id from one log line (security improvement); error logs include stack traces
  • A10 (SSRF): PASS - No new URL handling or redirects

Repository-Specific Security Checks

✅ SLACK_BOT_TOKEN not logged at any level
✅ SLACK_SIGNING_SECRET verification not modified
✅ Slack user PII (user_id) removed from one log, truncated in another
✅ No LLM prompt contents or internal documentation logged
✅ No new credential hardcoding
✅ No Neo4j credentials exposed
✅ No ANTHROPIC_API_KEY exposure

Recommendation

Approve this PR. The changes fix a legitimate logging infrastructure bug that was masking production issues, improve error handling resilience, and actually enhance security by removing user ID from feedback logs. All changes follow secure logging principles per the repository's security context.

Findings

🔵 [INFO] Logging truncated user question text at INFO level

File: src/knowledge_base/slack/bot.py:line 572
Category: A09_logging_monitoring
Impact: Question content is truncated to 80 chars, limiting context exposure while preserving observability for debugging
Recommendation: Current implementation is acceptable. The truncation ([:80]) provides visibility without exposing full user input. No change needed.

🔵 [INFO] Removed user_id from log message in _handle_feedback_action

File: src/knowledge_base/slack/bot.py:line 759
Category: A09_logging_monitoring
Impact: Prevents Slack user ID (PII) from being logged at INFO level, improving privacy posture
Recommendation: This is a positive security improvement. No action required.

OWASP Top 10 Checklist

Category Status
A01 Access Control ✅ PASS
A02 Crypto Failures ✅ PASS
A03 Injection ✅ PASS
A04 Insecure Design ✅ PASS
A05 Misconfiguration ✅ PASS
A06 Vulnerable Components ➖ N/A
A07 Auth Failures ✅ PASS
A08 Integrity Failures ✅ PASS
A09 Logging Monitoring ✅ PASS
A10 Ssrf ✅ PASS

🤖 Security review powered by Claude

@manana2520 manana2520 merged commit 213aa04 into main Feb 22, 2026
9 checks passed
@manana2520 manana2520 deleted the fix/bot-logging-and-reliability branch February 22, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant