Skip to content

Comments

Fix Neo4j stale connection errors with connection pool resilience#15

Merged
manana2520 merged 2 commits intomainfrom
fix/neo4j-connection-resilience
Feb 20, 2026
Merged

Fix Neo4j stale connection errors with connection pool resilience#15
manana2520 merged 2 commits intomainfrom
fix/neo4j-connection-resilience

Conversation

@manana2520
Copy link
Contributor

Summary

  • Production Slack bot was returning "I couldn't find relevant information" because Neo4j TCP connections went stale during the 18-hour pipeline intake run
  • Root cause: Neo4j async driver default liveness_check_timeout=None (disabled) — stale connections are never detected, fail with RuntimeError: TCPTransport closed
  • Fix: Create Neo4j driver with explicit connection pool resilience settings

Changes

  • config.py: Add 3 new configurable settings (NEO4J_LIVENESS_CHECK_TIMEOUT, NEO4J_MAX_CONNECTION_LIFETIME, NEO4J_CONNECTION_ACQUISITION_TIMEOUT)
  • graphiti_client.py: Create AsyncGraphDatabase.driver() with resilience settings, inject into Graphiti's Neo4jDriver via graph_driver= parameter

Settings

Setting Default (before) New Value Purpose
liveness_check_timeout None (disabled) 30s Ping connection before each use
max_connection_lifetime 3600s (1 hour) 1800s (30 min) Recycle connections more often
connection_acquisition_timeout 30s 60s More headroom during contention

Test plan

  • Unit tests pass
  • Docker images build
  • E2E tests pass on staging (bot can answer questions)
  • Verify bot responds correctly after deploy

Bot was returning 'I couldn't find relevant information' because Neo4j
TCP connections went stale during the 18-hour pipeline intake. The default
Neo4j driver has liveness_check_timeout=None (disabled), so stale
connections are never detected and fail with 'TCPTransport closed'.

Add three connection pool settings:
- liveness_check_timeout=30s: ping connection before each use
- max_connection_lifetime=1800s: recycle every 30min (was 1h)
- connection_acquisition_timeout=60s: more headroom during contention

All configurable via environment variables.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR effectively addresses the reported issue of stale Neo4j connections by introducing configurable connection pool resilience settings for the AsyncGraphDatabase.driver() in src/knowledge_base/graph/graphiti_client.py and adding these settings to src/knowledge_base/config.py. The new configuration values (liveness check, max connection lifetime, acquisition timeout) are sensible for a production environment experiencing long-running processes.

However, there is a critical race condition in the GraphitiClient.get_client() method due to its asynchronous singleton initialization. If multiple async tasks attempt to call get_client() concurrently when the client has not yet been initialized, they will all proceed to create new Graphiti instances independently. This can lead to resource leaks, unexpected behavior, or multiple connections to the database, defeating the purpose of the singleton. This should be protected using an asyncio.Lock.

Additionally, the current approach of creating graphiti_core.driver.neo4j_driver.Neo4jDriver with its default internal driver, then explicitly closing that default driver and replacing it with a custom-configured one (neo4j_driver.client = neo4j_async_driver), is a workaround that relies heavily on the internal implementation details of the graphiti_core library. While it might be necessary if graphiti_core doesn't provide a direct way to inject connection pool settings or a pre-configured AsyncGraphDatabase.driver into Neo4jDriver's constructor, it makes the code fragile to future updates of graphiti_core. If a more direct integration path exists or can be added to graphiti_core, that would be preferable. Assuming it doesn't, the workaround is functionally correct for now, but the fragility should be noted.

Requested Change:

  • Implement asyncio.Lock for singleton initialization: Protect the GraphitiClient.get_client() method's critical section (where _instance is assigned) with an asyncio.Lock to prevent race conditions during concurrent initialization. This ensures only one Graphiti instance is created and shared across the application.

Once the race condition is addressed, this PR will be ready for approval.

github-actions[bot]
github-actions bot previously approved these changes Feb 20, 2026
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security Review

Security Score: A | 2 INFO

Security Review Summary: Neo4j Connection Pool Resilience PR

Decision: APPROVE ✅

This PR introduces Neo4j connection pool resilience settings to fix stale connection errors during long-running pipeline intake operations. The implementation is security-appropriate with no credential exposure risks.

Key Security Findings

POSITIVE:

  • ✅ All three new Neo4j configuration settings (NEO4J_LIVENESS_CHECK_TIMEOUT, NEO4J_MAX_CONNECTION_LIFETIME, NEO4J_CONNECTION_ACQUISITION_TIMEOUT) properly defined in config.py using pydantic-settings (no hardcoding)
  • ✅ Neo4j credentials (neo4j_password) are never logged or exposed in error messages
  • ✅ The AsyncGraphDatabase.driver() call correctly injects configured timeouts to prevent stale connections
  • ✅ Error handling in _create_neo4j_client() properly wraps exceptions without credential leakage
  • ✅ Follows accepted ADR-002: empty string defaults for optional NEO4J_PASSWORD allow graceful degradation
  • ✅ No removal of security mechanisms (HMAC verification, connection validation, etc.)
  • ✅ No changes to public API exposure or firewall rules
  • ✅ No PII/sensitive data logging at any level

INFORMATIONAL NOTES:

  • Line 294: api_key="vertex-ai-mode" is a documented design pattern for Vertex AI authentication (uses service account credentials, not API keys). Safe but could benefit from inline documentation.
  • Line 143: Error message mentions "NEO4J_PASSWORD is required" which is appropriate operational feedback without exposing the actual credential value.

OWASP Top 10 Assessment

All categories PASS:

  • A01 (Access Control): ✅ No privilege escalation or unauthorized operation risks
  • A02 (Cryptographic Failures): ✅ Credentials properly stored in pydantic-settings, not hardcoded
  • A04 (Insecure Design): ✅ Connection pool resilience actually strengthens the design by preventing stale connection failures
  • A05 (Misconfiguration): ✅ Default values are reasonable and configurable
  • A09 (Logging): ✅ No credentials in logs; error messages appropriately scoped

Verification

Pre-scan finding at line 294 is ACCEPTABLE per ADR-004 (Vertex AI authentication pattern). Pre-scan finding at line 143 is a false positive (appropriate error message, not credential exposure).

Recommendation

Approve without changes. The PR correctly addresses the production issue (stale Neo4j connections) using secure, configuration-driven parameters that align with the repository's security patterns.

Findings

🔵 [INFO] Placeholder string 'vertex-ai-mode' used as api_key parameter for Vertex AI Gemini client

File: src/knowledge_base/graph/graphiti_client.py:line 294
Category: A05_misconfiguration
Impact: This is a design pattern where the api_key parameter is a placeholder because Vertex AI authentication uses service account credentials instead. Not a credential exposure risk, but could be confusing to maintainers.
Recommendation: Add a code comment clarifying that 'vertex-ai-mode' is intentional and that real credentials come from GCP service account (automatic in Cloud Run). Consider adding a constant like VERTEX_AI_MODE_PLACEHOLDER = 'vertex-ai-mode' for clarity.

🔵 [INFO] Error message includes 'NEO4J_PASSWORD is required for Neo4j backend' which mentions credential name

File: src/knowledge_base/graph/graphiti_client.py:line 143
Category: A05_misconfiguration
Impact: Minimal - error message indicates what credential is missing without exposing the actual value. This is appropriate for operational troubleshooting.
Recommendation: No change needed. The error message is appropriately scoped and does not expose the actual password value.

OWASP Top 10 Checklist

Category Status
A01 Access Control ✅ PASS
A02 Crypto Failures ✅ PASS
A03 Injection ✅ PASS
A04 Insecure Design ✅ PASS
A05 Misconfiguration ✅ PASS
A06 Vulnerable Components ✅ PASS
A07 Auth Failures ✅ PASS
A08 Integrity Failures ✅ PASS
A09 Logging Monitoring ✅ PASS
A10 Ssrf ✅ PASS

🤖 Security review powered by Claude

Add asyncio.Lock to prevent concurrent initialization from multiple
async tasks creating duplicate Graphiti/Neo4j driver instances.
Uses double-checked locking pattern for efficiency.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This Pull Request effectively addresses the reported issue of stale Neo4j connections causing TCPTransport closed errors in long-running processes. The solution is well-implemented and aligns perfectly with our architecture standards.

Key Strengths:

  1. Intent Fulfilled: The PR directly solves the problem by introducing critical Neo4j connection pool resilience settings (liveness_check_timeout, max_connection_lifetime, connection_acquisition_timeout). The chosen default values appear reasonable to mitigate stale connection issues.
  2. Architectural Alignment: Configuration is handled correctly using pydantic-settings in config.py, maintaining our standard. The integration with Graphiti's Neo4jDriver is also correctly handled by creating a custom AsyncGraphDatabase.driver and injecting it, despite the graphiti-core's Neo4jDriver not having a direct constructor for this.
  3. Code Quality: The code is clean, includes clear comments explaining the rationale behind the changes, and maintains type hints. The lazy initialization and asyncio.Lock for the GraphitiClient singleton are robust.
  4. Security: No new security vulnerabilities were introduced. Sensitive credentials like NEO4J_PASSWORD are handled correctly via settings.

The approach of closing the Neo4jDriver's default client before assigning the custom one is a valid way to ensure proper resource management when replacing the underlying driver instance.

Great work on identifying the root cause and implementing a resilient solution!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security Review

Security Score: A | 2 INFO

Security Review Summary

Decision: APPROVE ✓

This PR introduces Neo4j connection pool resilience settings to fix stale connection errors in production. The implementation demonstrates strong security practices and introduces no new vulnerabilities.

Key Strengths

  1. Configuration Management: All three new settings (NEO4J_LIVENESS_CHECK_TIMEOUT, NEO4J_MAX_CONNECTION_LIFETIME, NEO4J_CONNECTION_ACQUISITION_TIMEOUT) are properly defined in config.py using pydantic-settings with reasonable defaults. No hardcoded values in source code.

  2. Credential Handling: Neo4j credentials remain properly abstracted through settings. The neo4j_password parameter is correctly validated before use (line 157 in graphiti_client.py checks for empty password).

  3. Connection Pool Implementation: The PR correctly uses Neo4j's AsyncGraphDatabase.driver() with explicit resilience parameters:

    • liveness_check_timeout=30s: Detects stale connections before use
    • max_connection_lifetime=1800s: Recycles connections every 30 minutes to prevent long-lived stale states
    • connection_acquisition_timeout=60s: Provides headroom during contention
  4. No Credentials Logged: The implementation does not log sensitive credentials. The driver initialization (lines 152-157) passes credentials only to AsyncGraphDatabase constructor, not to logging statements.

  5. Error Handling: Error messages are generic and do not leak credential values or system internals that could aid attackers.

  6. Backward Compatibility: Settings use sensible defaults that work in both development and production environments. Empty string defaults for optional providers are consistent with repository's ADR-002.

Pre-scan False Positives

Both automated findings are false positives:

  • Line 308 api_key="vertex-ai-mode": This is a placeholder sentinel value for Vertex AI mode, not a credential. Real authentication uses genai.Client with service account credentials.
  • Line 157 error message: Generic error message does not expose the actual password value.

Compliance

  • ✓ Follows pydantic-settings pattern for all configuration
  • ✓ No hardcoded credentials in source code
  • ✓ Credentials validated at point of use
  • ✓ No credentials in error messages or logs
  • ✓ Neo4j connection resilience properly configured
  • ✓ Maintains existing security patterns (SLACK_SIGNING_SECRET verification, Neo4j auth, etc.)

Risk Assessment: LOW

The changes are focused, well-tested connection pool improvements with no security regressions or new attack surfaces introduced. The default values are reasonable and tunable via environment variables for different deployment scenarios.

Findings

🔵 [INFO] Pre-scan flagged 'api_key="vertex-ai-mode"' as hardcoded credential

File: src/knowledge_base/graph/graphiti_client.py:line 308
Category: Code Quality
Impact: False positive - this is a placeholder string (not a real credential) used only when Vertex AI authentication bypasses the config key
Recommendation: No action needed. The string 'vertex-ai-mode' is a sentinel value, not a credential. The actual authentication is handled by genai.Client with vertexai=True and service account credentials. This is properly documented in the comment above.

🔵 [INFO] Pre-scan flagged error message 'NEO4J_PASSWORD is required for Neo4j backend' as error exposure

File: src/knowledge_base/graph/graphiti_client.py:line 157
Category: A09 Logging & Monitoring
Impact: False positive - error message does not expose the password value, only indicates it is missing. This is appropriate for security troubleshooting
Recommendation: No action needed. The error message is generic and does not leak sensitive data.

OWASP Top 10 Checklist

Category Status
A01 Access Control ✅ PASS
A02 Crypto Failures ✅ PASS
A03 Injection ✅ PASS
A04 Insecure Design ✅ PASS
A05 Misconfiguration ✅ PASS
A06 Vulnerable Components ✅ PASS
A07 Auth Failures ✅ PASS
A08 Integrity Failures ✅ PASS
A09 Logging Monitoring ✅ PASS
A10 Ssrf ✅ PASS

🤖 Security review powered by Claude

@manana2520 manana2520 merged commit ff410f5 into main Feb 20, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant