Fix Neo4j stale connection recovery in search path#18
Conversation
The Slack bot returns "I couldn't find relevant information" when the Neo4j TCP connection goes stale after idle periods. The Neo4j driver's liveness check catches OSError/ServiceUnavailable/SessionExpired but RuntimeError from asyncio's transport layer escapes, silently returning zero search results. Add retry-with-reset at the GraphitiRetriever layer: - _is_connection_error() detects RuntimeError(TCPTransport), OSError, ServiceUnavailable, and SessionExpired - search_chunks() and _lookup_episodes() retry once on connection error after resetting the GraphitiClient singleton (fresh Neo4j connection) - NEO4J_SEARCH_MAX_RETRIES config setting (default: 1) - GraphitiClient.reset_and_reconnect() closes stale connection and clears singleton state
There was a problem hiding this comment.
LGTM! This Pull Request effectively addresses the critical issue of stale Neo4j connections, which was causing the Slack bot to fail with "I couldn't find relevant information" after idle periods. The solution is well-thought-out and robust.
Key strengths of this PR:
- Clear Intent: The PR directly tackles the problem described in the title and description by implementing a retry-with-reset mechanism for connection errors.
- Comprehensive Error Detection: The
_is_connection_errorhelper correctly identifies various connection-related exceptions, including the specificRuntimeErrorrelated toTCPTransport closedthat was observed in production. - Robust Retry Logic: The retry loops in both
_lookup_episodesandsearch_chunksare correctly implemented, leveraging a configurableNEO4J_SEARCH_MAX_RETRIESsetting. It correctly distinguishes between connection errors (which trigger a retry) and other application errors (which do not). - Effective Connection Reset: The
reset_and_reconnectmethod inGraphitiClientcorrectly forces a re-initialization of the Graphiti instance and its underlying Neo4j driver, ensuring a fresh connection. Crucially,self._graphiti = Nonein the retriever's retry block ensures the cached Graphiti instance is cleared, forcing_get_graphitito obtain a new one. - Thorough Unit Tests: The new unit tests cover various scenarios, including successful retries, exhausted retries, no retries for non-connection errors, and successful first attempts. This provides high confidence in the fix.
- Architectural Alignment: The changes integrate seamlessly with the existing Graphiti client singleton pattern, configuration management, and async structure.
No significant issues or security concerns were found. The solution is performant, resilient, and well-tested.
There was a problem hiding this comment.
🔒 Security Review
Security Score: A | 2 INFO
Security Review Summary: Fix Neo4j Stale Connection Recovery
Decision: APPROVE ✅
This PR implements a robust retry mechanism for Neo4j connection failures in the search path. The implementation is security-sound with no credential exposure, authentication bypass risks, or injection vulnerabilities.
Key Security Strengths
-
No Credential Exposure: The
_is_connection_error()helper does not log, expose, or handle any credentials. NEO4J_PASSWORD remains protected. -
Proper Error Classification: The helper correctly distinguishes connection errors (which warrant retry) from application errors (which should not retry). This prevents masking real bugs.
-
Slack Token & PII Protection: No changes to Slack bot integration, no new logging of SLACK_BOT_TOKEN or user PII (reporter_id, reporter_name).
-
No Injection Risks: The Cypher query in
_lookup_episodes()uses parameterized queries (WHERE ep.uuid IN $uuids) — no SQL/Cypher injection risk. -
Configuration Management: New
NEO4J_SEARCH_MAX_RETRIESsetting properly defined in config.py as a pydantic-settings field with reasonable default (1 retry). -
Connection Reset Handled: The
reset_and_reconnect()method in GraphitiClient properly closes stale connections and clears the singleton, enabling clean reconnection.
Detailed Findings
OWASP A02 (Cryptographic Failures): No credentials are hardcoded, logged, or exposed in error paths. NEO4J_PASSWORD handling is unchanged.
OWASP A09 (Logging & Monitoring): Warning logs include exception messages but do NOT include credentials. The specific error cases (TCPTransport, ServiceUnavailable, SessionExpired) are logged at WARNING level, which is appropriate for infrastructure issues.
Pre-Scan Finding Analysis:
- Flagged:
api_key="vertex-ai-mode"— This is a placeholder string, not a real credential. It's explicitly documented as unused with custom Vertex AI client. This is ACCEPTED per ADR-003. NOT A FINDING. - Flagged:
NEO4J_PASSWORD is required— This is an error message during initialization, not logging a credential value. PASS. - Flagged: Injection risks in graphiti_retriever.py — The code uses Graphiti's search() API, not raw Cypher injection. User queries are passed to Graphiti's safe search interface, not interpolated into Cypher. PASS.
Test Coverage
16 unit tests cover:
- RuntimeError detection with TCPTransport
- OSError and connection errors
- Neo4j-specific exceptions (ServiceUnavailable, SessionExpired)
- Retry behavior on connection errors
- No retry on non-connection errors
- Exhausted retries returning empty results
- Successful first-attempt execution
All tests pass, no regressions in existing 296 tests.
Minor Notes
- The
_is_connection_error()helper uses string matching ("TCPTransport" in str(exc)). This is pragmatic given asyncio's generic RuntimeError. If asyncio changes error message format in future versions, this could fail silently. Document as a known limitation. - Exception messages in logs are appropriate for operational debugging without exposing secrets.
Acceptance Criteria Met
✅ No credential exposure in logs or error handling
✅ No auth bypass or privilege escalation
✅ No injection vulnerabilities
✅ Configuration properly externalized
✅ Comprehensive unit tests with retry logic verification
✅ No changes to Slack security or Neo4j access controls
✅ Follows established security patterns (pydantic-settings, error handling)
Recommendation: APPROVE for merge. Implementation is security-sound and addresses a real production issue (stale Neo4j connections after idle periods).
Findings
🔵 [INFO] Connection error detection uses string matching on exception message ('TCPTransport' in str(exc)). While this approach works, it's fragile if error message format changes.
File: src/knowledge_base/graph/graphiti_retriever.py:line 37
Category: Code Quality
Impact: If Neo4j driver or asyncio changes error message format, TCPTransport errors may not be detected and retries won't trigger.
Recommendation: Consider checking exception type hierarchy more robustly, though current approach is reasonable given asyncio's RuntimeError is generic. Document this as a known limitation.
🔵 [INFO] Warning log message includes exception object which may contain connection details. Generally safe but worth noting.
File: src/knowledge_base/graph/graphiti_retriever.py:line 246
Category: Logging & Monitoring
Impact: Exception details logged could theoretically reveal internal network topology in connection error cases.
Recommendation: Current implementation is acceptable. The warning is specific to connection retry scenarios and doesn't include credentials.
OWASP Top 10 Checklist
| Category | Status |
|---|---|
| A01 Access Control | ✅ PASS |
| A02 Crypto Failures | ✅ PASS |
| A03 Injection | ✅ PASS |
| A04 Insecure Design | ✅ PASS |
| A05 Misconfiguration | ✅ PASS |
| A06 Vulnerable Components | ✅ PASS |
| A07 Auth Failures | ✅ PASS |
| A08 Integrity Failures | ✅ PASS |
| A09 Logging Monitoring | ✅ PASS |
| A10 Ssrf | ✅ PASS |
🤖 Security review powered by Claude
Summary
OSError/ServiceUnavailable/SessionExpiredbutRuntimeErrorfrom asyncio's transport layer (TCPTransport closed) escapes, silently returning zero search resultsGraphitiRetrieverlayer: on connection error, reset the singletonGraphitiClientand retry once with a fresh Neo4j connectionChanges
graphiti_retriever.py: Added_is_connection_error()helper + retry loop insearch_chunks()and_lookup_episodes()graphiti_client.py: Addedreset_and_reconnect()method to close stale connection and clear singletonconfig.py: AddedNEO4J_SEARCH_MAX_RETRIESsetting (default: 1)tests/test_connection_retry.py: 16 unit tests covering error detection, retry behavior, and edge casesTest plan