Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update HA page: unresponsive endpoint detection and node failure fixes #2878

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

bgrenon
Copy link
Collaborator

@bgrenon bgrenon commented Feb 4, 2025

This page was updated mostly to add "compute failover' sections for:

  • Node failure
  • Unresponsive endpoint
  • AZ failure

Preview: https://neon-next-git-bgrenon-ha-update-neondatabase.vercel.app/docs/introduction/high-availability#compute-failover

@bgrenon bgrenon requested a review from danieltprice as a code owner February 4, 2025 19:39
Copy link

vercel bot commented Feb 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
neon-next ✅ Ready (Inspect) Visit Preview 💬 Add feedback Feb 12, 2025 2:45pm

@bgrenon bgrenon changed the title Update HA page for improvements to crashlooping endpoint detection and node failure Update HA page cycling endpoint detection and node failure fixes Feb 4, 2025

If a compute endpoint is in a degraded state (repeatedly crashing and restarting rather than failing outright), we will detect and reattach it automatically, typically within 5 minutes. During this time, your application may experience intermittent connectivity.

#### Node failures
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

| VM failure | Brief interruption | VM recreation and endpoint reattachment | Seconds |
| Degraded endpoint | Possible intermittent connectivity | Automatic detection and reattachment | Up to 5 minutes |
| Node failure | Compute unavailable | Rescheduling to healthy nodes | ~2 minutes |

### Impact on session data after a failure?

While your application should handle reconnections automatically, session-specific data like temporary tables, prepared statements, and the Local File Cache ([LFC](/docs/reference/glossary#local-file-cache)), which stores frequently accessed data, will not persist across a failover. As a result, queries may initially run more slowly until the Postgres memory buffers and cache are rebuilt.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "failover" above. Should we use "recovery"? For the sake of argument, why can't we call this section "Compute failover" as some have suggested and explain that failover in Neon's serverless architecture is a little different from traditional failover.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try that!

@bgrenon bgrenon changed the title Update HA page cycling endpoint detection and node failure fixes Update HA page: unresponsive endpoint detection and node failure fixes Feb 7, 2025
Co-authored-by: Daniel <10074684+danieltprice@users.noreply.github.com>
Copy link
Collaborator

@danieltprice danieltprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get approval from someone in Development before posting. Maybe Vadim or Alexey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants