-
Notifications
You must be signed in to change notification settings - Fork 43
Description
Description:
Currently, the ProcessSearch function in the Event Producer (workers/event_producer/pkg/producer/producer.go) only wraps errors from AcquireLock with event.ErrTransientFailure. All other errors (Database reads/writes, GCS operations, external API calls) are returned as standard errors.
The Problem:
The Pub/Sub subscriber adapter is configured to ACK (drop) any message where the handler returns a non-transient error. This means that a temporary network glitch (e.g., Spanner read timeout, GCS 503) will cause the notification job to be permanently lost instead of retried.
webstatus.dev/lib/gcppubsub/client.go
Lines 68 to 82 in 04a9443
| err := sub.Receive(ctx, func(ctx context.Context, msg *pubsub.Message) { | |
| // Execute the worker's handler | |
| workErr := handler(ctx, msg.Data) | |
| if workErr == nil { | |
| // ACK: Success | |
| msg.Ack() | |
| } else if errors.Is(workErr, event.ErrTransientFailure) { | |
| // NACK: Retry later | |
| msg.Nack() | |
| } else { | |
| // ACK: Permanent failure or unknown error, do not retry | |
| slog.ErrorContext(ctx, "permanent failure", "error", workErr) | |
| msg.Ack() | |
| } | |
| }) |
Conversely, simply wrapping all errors as transient would cause "poison pills" (e.g., invalid Saved Search IDs) to loop infinitely.
Objective:
We need to implement a robust error handling strategy that retries by default (to catch network issues) but fails fast on known invalid states.
Proposed Changes:
- Define Standard Errors: Create
lib/workertypes/errors.goto define sentinel errors for adapters:
var ErrEntityNotFound = errors.New("entity not found")var ErrInvalidData = errors.New("invalid data format")
- Update Adapters: Modify Spanner and GCS adapters to return these specific errors where appropriate (e.g.,
GetLatestEventreturningErrEntityNotFoundif the search ID is invalid). - Update ProcessSearch Logic: Refactor error handling in
ProcessSearchto follow this pattern:
- Permanent Failure Check: If the error matches
workertypes.ErrEntityNotFoundorworkertypes.ErrInvalidData, return it directly (triggering an ACK/Drop). - Transient Default: For all other errors (unknown or network-related), wrap them with
event.ErrTransientFailure(triggering a NACK/Retry).
Acceptance Criteria:
- Transient failures (network blips) in DB, Storage, or Differ now trigger a retry (NACK).
- Permanent failures (missing search ID, bad data) trigger an immediate failure (ACK) and are not retried.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status