Fix Neo4j stale connection errors with connection pool resilience#15
Fix Neo4j stale connection errors with connection pool resilience#15manana2520 merged 2 commits intomainfrom
Conversation
Bot was returning 'I couldn't find relevant information' because Neo4j TCP connections went stale during the 18-hour pipeline intake. The default Neo4j driver has liveness_check_timeout=None (disabled), so stale connections are never detected and fail with 'TCPTransport closed'. Add three connection pool settings: - liveness_check_timeout=30s: ping connection before each use - max_connection_lifetime=1800s: recycle every 30min (was 1h) - connection_acquisition_timeout=60s: more headroom during contention All configurable via environment variables.
There was a problem hiding this comment.
This PR effectively addresses the reported issue of stale Neo4j connections by introducing configurable connection pool resilience settings for the AsyncGraphDatabase.driver() in src/knowledge_base/graph/graphiti_client.py and adding these settings to src/knowledge_base/config.py. The new configuration values (liveness check, max connection lifetime, acquisition timeout) are sensible for a production environment experiencing long-running processes.
However, there is a critical race condition in the GraphitiClient.get_client() method due to its asynchronous singleton initialization. If multiple async tasks attempt to call get_client() concurrently when the client has not yet been initialized, they will all proceed to create new Graphiti instances independently. This can lead to resource leaks, unexpected behavior, or multiple connections to the database, defeating the purpose of the singleton. This should be protected using an asyncio.Lock.
Additionally, the current approach of creating graphiti_core.driver.neo4j_driver.Neo4jDriver with its default internal driver, then explicitly closing that default driver and replacing it with a custom-configured one (neo4j_driver.client = neo4j_async_driver), is a workaround that relies heavily on the internal implementation details of the graphiti_core library. While it might be necessary if graphiti_core doesn't provide a direct way to inject connection pool settings or a pre-configured AsyncGraphDatabase.driver into Neo4jDriver's constructor, it makes the code fragile to future updates of graphiti_core. If a more direct integration path exists or can be added to graphiti_core, that would be preferable. Assuming it doesn't, the workaround is functionally correct for now, but the fragility should be noted.
Requested Change:
- Implement
asyncio.Lockfor singleton initialization: Protect theGraphitiClient.get_client()method's critical section (where_instanceis assigned) with anasyncio.Lockto prevent race conditions during concurrent initialization. This ensures only oneGraphitiinstance is created and shared across the application.
Once the race condition is addressed, this PR will be ready for approval.
There was a problem hiding this comment.
🔒 Security Review
Security Score: A | 2 INFO
Security Review Summary: Neo4j Connection Pool Resilience PR
Decision: APPROVE ✅
This PR introduces Neo4j connection pool resilience settings to fix stale connection errors during long-running pipeline intake operations. The implementation is security-appropriate with no credential exposure risks.
Key Security Findings
POSITIVE:
- ✅ All three new Neo4j configuration settings (
NEO4J_LIVENESS_CHECK_TIMEOUT,NEO4J_MAX_CONNECTION_LIFETIME,NEO4J_CONNECTION_ACQUISITION_TIMEOUT) properly defined inconfig.pyusing pydantic-settings (no hardcoding) - ✅ Neo4j credentials (
neo4j_password) are never logged or exposed in error messages - ✅ The AsyncGraphDatabase.driver() call correctly injects configured timeouts to prevent stale connections
- ✅ Error handling in
_create_neo4j_client()properly wraps exceptions without credential leakage - ✅ Follows accepted ADR-002: empty string defaults for optional
NEO4J_PASSWORDallow graceful degradation - ✅ No removal of security mechanisms (HMAC verification, connection validation, etc.)
- ✅ No changes to public API exposure or firewall rules
- ✅ No PII/sensitive data logging at any level
INFORMATIONAL NOTES:
- Line 294:
api_key="vertex-ai-mode"is a documented design pattern for Vertex AI authentication (uses service account credentials, not API keys). Safe but could benefit from inline documentation. - Line 143: Error message mentions "NEO4J_PASSWORD is required" which is appropriate operational feedback without exposing the actual credential value.
OWASP Top 10 Assessment
All categories PASS:
- A01 (Access Control): ✅ No privilege escalation or unauthorized operation risks
- A02 (Cryptographic Failures): ✅ Credentials properly stored in pydantic-settings, not hardcoded
- A04 (Insecure Design): ✅ Connection pool resilience actually strengthens the design by preventing stale connection failures
- A05 (Misconfiguration): ✅ Default values are reasonable and configurable
- A09 (Logging): ✅ No credentials in logs; error messages appropriately scoped
Verification
Pre-scan finding at line 294 is ACCEPTABLE per ADR-004 (Vertex AI authentication pattern). Pre-scan finding at line 143 is a false positive (appropriate error message, not credential exposure).
Recommendation
Approve without changes. The PR correctly addresses the production issue (stale Neo4j connections) using secure, configuration-driven parameters that align with the repository's security patterns.
Findings
🔵 [INFO] Placeholder string 'vertex-ai-mode' used as api_key parameter for Vertex AI Gemini client
File: src/knowledge_base/graph/graphiti_client.py:line 294
Category: A05_misconfiguration
Impact: This is a design pattern where the api_key parameter is a placeholder because Vertex AI authentication uses service account credentials instead. Not a credential exposure risk, but could be confusing to maintainers.
Recommendation: Add a code comment clarifying that 'vertex-ai-mode' is intentional and that real credentials come from GCP service account (automatic in Cloud Run). Consider adding a constant like VERTEX_AI_MODE_PLACEHOLDER = 'vertex-ai-mode' for clarity.
🔵 [INFO] Error message includes 'NEO4J_PASSWORD is required for Neo4j backend' which mentions credential name
File: src/knowledge_base/graph/graphiti_client.py:line 143
Category: A05_misconfiguration
Impact: Minimal - error message indicates what credential is missing without exposing the actual value. This is appropriate for operational troubleshooting.
Recommendation: No change needed. The error message is appropriately scoped and does not expose the actual password value.
OWASP Top 10 Checklist
| Category | Status |
|---|---|
| A01 Access Control | ✅ PASS |
| A02 Crypto Failures | ✅ PASS |
| A03 Injection | ✅ PASS |
| A04 Insecure Design | ✅ PASS |
| A05 Misconfiguration | ✅ PASS |
| A06 Vulnerable Components | ✅ PASS |
| A07 Auth Failures | ✅ PASS |
| A08 Integrity Failures | ✅ PASS |
| A09 Logging Monitoring | ✅ PASS |
| A10 Ssrf | ✅ PASS |
🤖 Security review powered by Claude
Add asyncio.Lock to prevent concurrent initialization from multiple async tasks creating duplicate Graphiti/Neo4j driver instances. Uses double-checked locking pattern for efficiency.
There was a problem hiding this comment.
LGTM! This Pull Request effectively addresses the reported issue of stale Neo4j connections causing TCPTransport closed errors in long-running processes. The solution is well-implemented and aligns perfectly with our architecture standards.
Key Strengths:
- Intent Fulfilled: The PR directly solves the problem by introducing critical Neo4j connection pool resilience settings (
liveness_check_timeout,max_connection_lifetime,connection_acquisition_timeout). The chosen default values appear reasonable to mitigate stale connection issues. - Architectural Alignment: Configuration is handled correctly using
pydantic-settingsinconfig.py, maintaining our standard. The integration with Graphiti'sNeo4jDriveris also correctly handled by creating a customAsyncGraphDatabase.driverand injecting it, despite thegraphiti-core'sNeo4jDrivernot having a direct constructor for this. - Code Quality: The code is clean, includes clear comments explaining the rationale behind the changes, and maintains type hints. The lazy initialization and
asyncio.Lockfor theGraphitiClientsingleton are robust. - Security: No new security vulnerabilities were introduced. Sensitive credentials like
NEO4J_PASSWORDare handled correctly via settings.
The approach of closing the Neo4jDriver's default client before assigning the custom one is a valid way to ensure proper resource management when replacing the underlying driver instance.
Great work on identifying the root cause and implementing a resilient solution!
There was a problem hiding this comment.
🔒 Security Review
Security Score: A | 2 INFO
Security Review Summary
Decision: APPROVE ✓
This PR introduces Neo4j connection pool resilience settings to fix stale connection errors in production. The implementation demonstrates strong security practices and introduces no new vulnerabilities.
Key Strengths
-
Configuration Management: All three new settings (
NEO4J_LIVENESS_CHECK_TIMEOUT,NEO4J_MAX_CONNECTION_LIFETIME,NEO4J_CONNECTION_ACQUISITION_TIMEOUT) are properly defined inconfig.pyusing pydantic-settings with reasonable defaults. No hardcoded values in source code. -
Credential Handling: Neo4j credentials remain properly abstracted through settings. The
neo4j_passwordparameter is correctly validated before use (line 157 in graphiti_client.py checks for empty password). -
Connection Pool Implementation: The PR correctly uses Neo4j's
AsyncGraphDatabase.driver()with explicit resilience parameters:liveness_check_timeout=30s: Detects stale connections before usemax_connection_lifetime=1800s: Recycles connections every 30 minutes to prevent long-lived stale statesconnection_acquisition_timeout=60s: Provides headroom during contention
-
No Credentials Logged: The implementation does not log sensitive credentials. The driver initialization (lines 152-157) passes credentials only to AsyncGraphDatabase constructor, not to logging statements.
-
Error Handling: Error messages are generic and do not leak credential values or system internals that could aid attackers.
-
Backward Compatibility: Settings use sensible defaults that work in both development and production environments. Empty string defaults for optional providers are consistent with repository's ADR-002.
Pre-scan False Positives
Both automated findings are false positives:
- Line 308
api_key="vertex-ai-mode": This is a placeholder sentinel value for Vertex AI mode, not a credential. Real authentication uses genai.Client with service account credentials. - Line 157 error message: Generic error message does not expose the actual password value.
Compliance
- ✓ Follows pydantic-settings pattern for all configuration
- ✓ No hardcoded credentials in source code
- ✓ Credentials validated at point of use
- ✓ No credentials in error messages or logs
- ✓ Neo4j connection resilience properly configured
- ✓ Maintains existing security patterns (SLACK_SIGNING_SECRET verification, Neo4j auth, etc.)
Risk Assessment: LOW
The changes are focused, well-tested connection pool improvements with no security regressions or new attack surfaces introduced. The default values are reasonable and tunable via environment variables for different deployment scenarios.
Findings
🔵 [INFO] Pre-scan flagged 'api_key="vertex-ai-mode"' as hardcoded credential
File: src/knowledge_base/graph/graphiti_client.py:line 308
Category: Code Quality
Impact: False positive - this is a placeholder string (not a real credential) used only when Vertex AI authentication bypasses the config key
Recommendation: No action needed. The string 'vertex-ai-mode' is a sentinel value, not a credential. The actual authentication is handled by genai.Client with vertexai=True and service account credentials. This is properly documented in the comment above.
🔵 [INFO] Pre-scan flagged error message 'NEO4J_PASSWORD is required for Neo4j backend' as error exposure
File: src/knowledge_base/graph/graphiti_client.py:line 157
Category: A09 Logging & Monitoring
Impact: False positive - error message does not expose the password value, only indicates it is missing. This is appropriate for security troubleshooting
Recommendation: No action needed. The error message is generic and does not leak sensitive data.
OWASP Top 10 Checklist
| Category | Status |
|---|---|
| A01 Access Control | ✅ PASS |
| A02 Crypto Failures | ✅ PASS |
| A03 Injection | ✅ PASS |
| A04 Insecure Design | ✅ PASS |
| A05 Misconfiguration | ✅ PASS |
| A06 Vulnerable Components | ✅ PASS |
| A07 Auth Failures | ✅ PASS |
| A08 Integrity Failures | ✅ PASS |
| A09 Logging Monitoring | ✅ PASS |
| A10 Ssrf | ✅ PASS |
🤖 Security review powered by Claude
Summary
liveness_check_timeout=None(disabled) — stale connections are never detected, fail withRuntimeError: TCPTransport closedChanges
config.py: Add 3 new configurable settings (NEO4J_LIVENESS_CHECK_TIMEOUT,NEO4J_MAX_CONNECTION_LIFETIME,NEO4J_CONNECTION_ACQUISITION_TIMEOUT)graphiti_client.py: CreateAsyncGraphDatabase.driver()with resilience settings, inject into Graphiti'sNeo4jDriverviagraph_driver=parameterSettings
liveness_check_timeoutNone(disabled)30smax_connection_lifetime3600s(1 hour)1800s(30 min)connection_acquisition_timeout30s60sTest plan