Connection Reset and Timeout Fixes

Problem Summary

The backend was experiencing:

SSE connection failures - Streams dying after exactly 60 seconds with "context deadline exceeded"
Connection reset errors - Intermittent "connection reset by peer" when calling Restate
Heartbeat failures - SSE heartbeats failing due to timeout

Root Cause

The global timeout middleware in main.go was aggressively killing all requests after 60 seconds:

r.Use(middleware.Timeout(60 * time.Second)) // ❌ Kills SSE streams!

SSE connections are meant to be long-lived, but the middleware was force-closing them, causing cascading failures.

Fixes Applied

1. Selective Timeout Middleware

Changed: Removed global timeout and applied it only to API routes

// ✅ NO global timeout
r.Use(middleware.Logger)
r.Use(middleware.Recoverer)
r.Use(middleware.RealIP)

// ✅ Timeout ONLY for API routes
r.Route("/api", func(r chi.Router) {
    r.Use(middleware.Timeout(60 * time.Second))
    // ... API routes
})

// ✅ SSE routes have NO timeout
r.Get("/stream/notifications", handlers.StreamNotifications)
r.Get("/stream/workflow/{orderID}", handlers.StreamWorkflowStatus)

2. Restate Health Check Endpoint

Added: /health/restate endpoint to monitor Restate connectivity

render_diffs(file:///home/chaschel/Documents/ibm/go/apps/zeroapp/prototype/backend/handlers/ingress.go)

Features:

2-second timeout for health check requests
Distinguishes between connection errors and business logic errors
Returns detailed status information

Usage:

curl http://localhost:8081/health/restate
# Response: {"status":"healthy","url":"http://localhost:9089","note":"Restate SDK is reachable"}

3. Enhanced Error Handling

Added defensive error handling to cart operations:

✅ Per-request 5-second timeout contexts
✅ Detailed logging at each step
✅ Timeout detection with specific error messages
✅ Success logging for debugging

Example - GetCart with timeout handling:

// Create timeout context for Restate call
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()

basket, err := restateingress.Object[restate.Void, []models.CartItem](
    h.client, "UserSession", userID, "GetBasket",
).Request(ctx, restate.Void{})

if err != nil {
    if ctx.Err() == context.DeadlineExceeded {
        log.Printf("Timeout fetching cart for user %s: %v", userID, err)
        http.Error(w, "Request timeout - Restate may be unavailable", http.StatusGatewayTimeout)
        return
    }
    // ... other error handling
}

4. Improved Health Check

Enhanced: Main /health endpoint with structured response

{
  "status": "healthy",
  "services": {
    "http": "ok",
    "database": "ok",
    "restate": "check /health/restate"
  }
}

Testing

1. Verify SSE Streams Work

# This should stay connected indefinitely (beyond 60s)
curl http://localhost:8081/stream/notifications

Expected: No "context deadline exceeded" errors after 60 seconds

2. Test Cart Operations

# Add item to cart
curl -X POST http://localhost:8081/api/cart/add \
  -H "Content-Type: application/json" \
  -d '{"product_id": 1, "quantity": 2}'

# Get cart
curl http://localhost:8081/api/cart

Expected:

Detailed logs in backend console
No connection reset errors
Successful responses

3. Check Health Endpoints

# General health
curl http://localhost:8081/health

# Restate connectivity
curl http://localhost:8081/health/restate

Files Modified

main.go - Selective timeout middleware
ingress.go - Health checks and error handling

Next Steps

Restart the backend with the new build:

cd /home/chaschel/Documents/ibm/go/apps/zeroapp/prototype/backend
./bin/zeroapp

Monitor logs for the improved logging output
Test SSE stability by keeping a browser tab open for > 60 seconds
Verify cart operations no longer experience connection resets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection Reset and Timeout Fixes

Problem Summary

Root Cause

Fixes Applied

1. Selective Timeout Middleware

2. Restate Health Check Endpoint

3. Enhanced Error Handling

4. Improved Health Check

Testing

1. Verify SSE Streams Work

2. Test Cart Operations

3. Check Health Endpoints

Files Modified

Next Steps

FilesExpand file tree

walkthrough.md

Latest commit

History

walkthrough.md

File metadata and controls

Connection Reset and Timeout Fixes

Problem Summary

Root Cause

Fixes Applied

1. Selective Timeout Middleware

2. Restate Health Check Endpoint

3. Enhanced Error Handling

4. Improved Health Check

Testing

1. Verify SSE Streams Work

2. Test Cart Operations

3. Check Health Endpoints

Files Modified

Next Steps