Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do delivery specs in broker configs apply to event processing errors as well as delivery failures? #3746

Open
gilesrapkinb opened this issue Mar 8, 2024 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@gilesrapkinb
Copy link

Describe the bug

This is more of a question than a bug report.

I've recently set up Knative eventing with Kafka brokers; one main broker, with another Kafka broker configured as a dead letter sink. Also a delivery spec with retries, exponential backoff etc. configured.

I was trying to test the dead letter sink setup with a test function that always returns a 500 in response to any events. The retry behaviour is not quite what I expected. Events seem to get retried forever instead of the retry count specified in the delivery spec.

I am able to get events to get sent to the dead letter sink under other scenarios, e.g. setting the test function to timeout, or respond in a way that doesn't conform to the delivery contract (like responding with a 200 without a cloudevent in the response body).

Is a function responding with a 500 error not considered a delivery failure? If not, what is the expected behaviour in this scenario? Is there another way to configure retry/backoff/dead letter behaviour for event processing failures?

Expected behavior

I initially expected the delivery spec to also apply to events that weren't processed successfully by a function that received the event (responded with a 500).

To Reproduce

  1. Create 2 kafka based brokers, and configure a delivery policy on the main broker.
    apiVersion: eventing.knative.dev/v1
    kind: Broker
    metadata:
      annotations:
        eventing.knative.dev/broker.class: Kafka
      labels:
        app: knative-broker
      name: knative-broker
      namespace: default
    spec:
      config:
        apiVersion: v1
        kind: ConfigMap
        name: knative-broker
        namespace: default
      delivery:
        backoffDelay: PT0.5S
        backoffPolicy: exponential
        deadLetterSink:
          ref:
            apiVersion: eventing.knative.dev/v1
            kind: Broker
            name: knative-broker-deadletter
            namespace: default
        retry: 12
    ---
    apiVersion: eventing.knative.dev/v1
    kind: Broker
    metadata:
      annotations:
        eventing.knative.dev/broker.class: Kafka
      labels:
        app: knative-broker
      name: knative-broker-deadletter
      namespace: default
    spec:
      config:
        apiVersion: v1
        kind: ConfigMap
        name: knative-broker
        namespace: default
      delivery:
        backoffDelay: PT0.2S
        backoffPolicy: exponential
        retry: 10
  2. Create a function that always returns a 500 in response to an event, deploy, with a trigger
  3. Send a test event to the broker, see it get repeatedly retried and not sent to the dead letter sink

Knative release version

1.13.1

Additional context

N/A

@gilesrapkinb gilesrapkinb added the kind/bug Categorizes issue or PR as related to a bug. label Mar 8, 2024
@pierDipi
Copy link
Member

pierDipi commented Mar 12, 2024

Thanks for reporting @gilesrapkinb, this is not expected, would you mind also sharing the step 2 and 3, so that we could just run them and debug the issue?

@gilesrapkinb
Copy link
Author

For step 2 you could use something like nicholasjackson/fake-service.

Deploy the function:

kn service create test500 \
--image nicholasjackson/fake-service:v0.26.2 \
--env 'LISTEN_ADDR=0.0.0.0:8080' \
--env 'ERROR_RATE=1'

Set up a trigger:

kn trigger create test500 \
--sink test500  \
--broker YOUR_BROKER

Then for step 3 push a test event to your broker.

In the kafka-broker-dispatcher logs you'll see repeated attempts to deliver the message (more than the retry setting in the broker delivery config above), with errors logged for the response from the function that look something like this:

{
  "@timestamp": "2024-03-12T19:40:36.969Z",
  "@version": "1",
  "message": "Received a failure status code that is not 2xx. failed to send event to subscriber context={topics=[knative-broker-dev-knative-broker], consumerGroup='knative-trigger-dev-test500', reference=uuid: \"f4666f13-7438-4a31-bb05-e248383d3466\"\nnamespace: \"dev\"\nname: \"test500\"\n} target=http://test500.dev.svc.cluster.local statusCode=500",
  "logger_name": "dev.knative.eventing.kafka.broker.dispatcher.impl.http.WebClientCloudEventSender",
  "thread_name": "vert.x-worker-thread-3",
  "level": "ERROR",
  "level_value": 40000,
  "context": {},
  "target": "http://test500.dev.svc.cluster.local",
  "statusCode": 500
}

Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2024
@gilesrapkinb
Copy link
Author

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2024
@gilesrapkinb
Copy link
Author

@pierDipi has anybody been able to reproduce this yet?

Copy link
Contributor

github-actions bot commented Oct 4, 2024

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2024
@gilesrapkinb
Copy link
Author

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants