Skip to content

Latest commit

 

History

History
159 lines (115 loc) · 4.24 KB

File metadata and controls

159 lines (115 loc) · 4.24 KB

Connection Reset and Timeout Fixes

Problem Summary

The backend was experiencing:

  1. SSE connection failures - Streams dying after exactly 60 seconds with "context deadline exceeded"
  2. Connection reset errors - Intermittent "connection reset by peer" when calling Restate
  3. Heartbeat failures - SSE heartbeats failing due to timeout

Root Cause

The global timeout middleware in main.go was aggressively killing all requests after 60 seconds:

r.Use(middleware.Timeout(60 * time.Second)) // ❌ Kills SSE streams!

SSE connections are meant to be long-lived, but the middleware was force-closing them, causing cascading failures.

Fixes Applied

1. Selective Timeout Middleware

Changed: Removed global timeout and applied it only to API routes

// ✅ NO global timeout
r.Use(middleware.Logger)
r.Use(middleware.Recoverer)
r.Use(middleware.RealIP)

// ✅ Timeout ONLY for API routes
r.Route("/api", func(r chi.Router) {
    r.Use(middleware.Timeout(60 * time.Second))
    // ... API routes
})

// ✅ SSE routes have NO timeout
r.Get("/stream/notifications", handlers.StreamNotifications)
r.Get("/stream/workflow/{orderID}", handlers.StreamWorkflowStatus)

2. Restate Health Check Endpoint

Added: /health/restate endpoint to monitor Restate connectivity

render_diffs(file:///home/chaschel/Documents/ibm/go/apps/zeroapp/prototype/backend/handlers/ingress.go)

Features:

  • 2-second timeout for health check requests
  • Distinguishes between connection errors and business logic errors
  • Returns detailed status information

Usage:

curl http://localhost:8081/health/restate
# Response: {"status":"healthy","url":"http://localhost:9089","note":"Restate SDK is reachable"}

3. Enhanced Error Handling

Added defensive error handling to cart operations:

  • ✅ Per-request 5-second timeout contexts
  • ✅ Detailed logging at each step
  • ✅ Timeout detection with specific error messages
  • ✅ Success logging for debugging

Example - GetCart with timeout handling:

// Create timeout context for Restate call
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()

basket, err := restateingress.Object[restate.Void, []models.CartItem](
    h.client, "UserSession", userID, "GetBasket",
).Request(ctx, restate.Void{})

if err != nil {
    if ctx.Err() == context.DeadlineExceeded {
        log.Printf("Timeout fetching cart for user %s: %v", userID, err)
        http.Error(w, "Request timeout - Restate may be unavailable", http.StatusGatewayTimeout)
        return
    }
    // ... other error handling
}

4. Improved Health Check

Enhanced: Main /health endpoint with structured response

{
  "status": "healthy",
  "services": {
    "http": "ok",
    "database": "ok",
    "restate": "check /health/restate"
  }
}

Testing

1. Verify SSE Streams Work

# This should stay connected indefinitely (beyond 60s)
curl http://localhost:8081/stream/notifications

Expected: No "context deadline exceeded" errors after 60 seconds

2. Test Cart Operations

# Add item to cart
curl -X POST http://localhost:8081/api/cart/add \
  -H "Content-Type: application/json" \
  -d '{"product_id": 1, "quantity": 2}'

# Get cart
curl http://localhost:8081/api/cart

Expected:

  • Detailed logs in backend console
  • No connection reset errors
  • Successful responses

3. Check Health Endpoints

# General health
curl http://localhost:8081/health

# Restate connectivity
curl http://localhost:8081/health/restate

Files Modified

  • main.go - Selective timeout middleware
  • ingress.go - Health checks and error handling

Next Steps

  1. Restart the backend with the new build:

    cd /home/chaschel/Documents/ibm/go/apps/zeroapp/prototype/backend
    ./bin/zeroapp
  2. Monitor logs for the improved logging output

  3. Test SSE stability by keeping a browser tab open for > 60 seconds

  4. Verify cart operations no longer experience connection resets