feat: add SSE streaming via /events #791

teleivo · 2024-06-13T08:11:38Z

This PR allows users to stream instance manager events via SSE. An instance manager event can be anything from "Your DB has been saved" to changes to a pod. This PR only implements the event consuming side. Another PR producing the first instance manager event will follow.

Use cases

These use cases are supported:

Note that in case C resume will only work if the EventSource got an event (id) before it lost connection. Only then can it send the HTTP header Last-Event-ID which allows us to resume. Otherwise, the user will get new messages only as in case A.

Architecture

Clients can stream events via HTTP server-sent events. Our web UI will rely on EventSource to establish, maintain the connection and deliver the SSE events to callbacks.

The instance manager opens a consumer to a RabbitMQ stream for every user connecting to the HTTP /events endpoint. By default (if no HTTP header Last-Event-ID) is sent new events will be relayed from RabbitMQ to the user via SSE. If HTTP header Last-Event-ID is sent then message from Last-Event-ID+1 will be sent.

Infrastructure Changes

We needed to do a couple of things to use RabbitMQ streaming with filtering

update RabbitMQ to 3.13+ for stream filtering as are on 3.12.8
enable plugins: at least rabbitmq_stream
expose stream port 5552 for streaming
enable stable feature flags Enable all feature flags on upgrade rabbitmq/cluster-operator#1240 kubectl exec -n namespace rabbitmq -it -- rabbitmqctl enable_feature_flag all

Message Retention

Some napkin math 🔢 first:

At some point we want to push k8s status updates via SSE. I watched k8s pod events for a day on all namespaces and saw k8s event numbers of

10585.5 per day
~441 per hour
~7.4 per minute
~0.13 per second

Note: the numbers are just captured during one day. Activity on other days might be higher and/or grow over time as more users become active.

I sent 20974 messages with this RabbitMQ message data was (add the application headers of group and kind)

{
  Instance:   "my-instance",
  Status:     "Up",
  Deployment: "my-deployment",
  Message:    "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
}

This resulted in 12M total

du -sh /var/lib/rabbitmq/mnesia/node/stream/__events_1718855104927203033
12M     /var/lib/rabbitmq/mnesia/rabbit@im-rabbitmq-rabbitmq-update-0.im-rabbitmq-rabbitmq-update-headless.instance-manager-feature.svc.cluster.local/stream/__events_1718855104927203033

RabbitMQ allows us to set retention policies for max segment size, max stream size and max age:

The retention policies will mainly be evaluated when a new segment is created. Segments should thus not be too big as otherwise policies would not be evaluated but also not be too small as this causes overhead for the broker.
Both max age and max stream length need to be reached for deletion to occur.
Different to queues we can change these via policies later on without having to delete the stream Policies take precedence over stream arguments rabbitmq/rabbitmq-server#3087

https://www.cloudamqp.com/blog/rabbitmq-streams-and-replay-features-part-3-limits-and-configurations-for-streams-in-rabbitmq.html
https://groups.google.com/g/rabbitmq-users/c/TQG_nE2m4GQ

We decided we want to keep messages for 1h. Our types of messages take up little space 20974 messages / 12MB. Even storing 12000 MB worth of messages would not cause us harm on disk. These would make up ~ 2.1 million messages which if we would reach in 1h would highlight a much bigger problem. This and the fact that the max age and stream size would be combined in an and is why we only pick the max age retention.

We picked

a max segment size of 1MB (this holds ~20974÷12=1747.83 of messages)
a max retention time of 1h

Testing

We have an integration test with 2 users streaming events. The users are in a shared group and one of them is in an exclusive group. We then test the routing logic of users only getting events they should see. We also test resuming when a connection is cancelled.

Docker issues

We switched to using the Docker image in our tests https://github.com/bitnami/containers/tree/main/bitnami/rabbitmq as this is what we use in the cluster.

https://www.rabbitmq.com/blog/2021/07/23/connecting-to-streams When connecting to a stream we pass a URI but the clients will then ask the RabbitMQ nodes for their host/port and use that to stream. This is configured via advertised_host/port in RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS or rabbitmq.conf. This config cannot be changed during runtime.

Our tests run outside of Docker while RabbitMQ is in a container. The advertised_port is 5552 by default. If we rely on Docker picking a random port our Go tests will not be able to connect. We thus expose the fixed port 5552. We set advertised_host to localhost as our host is not able to resolve the Docker container name or IP (at least not without more ⛑️).

The Bitnami image does not allow setting RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS an env var so we need to mount a config file.

Gnomock does not allow mounting a file 😓 Gnomock also uses the official RabbitMQ image. We thus have to use testcontainers for more control over the Docker container. We only use testcontainers for RabbitMQ.

Additional reading

pkg/event/handler.go

We originally used slogin but switched away from it due to some limitations. We forgot to get the request id from the context via our own function. It was thus not found in the context.

pkg/event/handler.go

by passing the gin.Context

its not needed

so use matchUnfiltered as the consumer should not get messages that have no group

the predicates do return false on error but it looks odd that we do not return

pkg/event/handler.go

pkg/event/event_integration_test.go

radnov

🚀

github-advanced-security bot found potential problems Jun 13, 2024

View reviewed changes

pkg/event/handler.go Fixed Show fixed Hide fixed

teleivo force-pushed the sse-events-consume branch from bc1b14a to 894fb7f Compare June 13, 2024 09:01

teleivo changed the title ~~feat: sse events~~ feat: add SSE streaming via /events Jun 13, 2024

teleivo force-pushed the sse-events-consume branch 4 times, most recently from 2f45396 to 81905c8 Compare June 13, 2024 10:16

github-advanced-security bot found potential problems Jun 13, 2024

View reviewed changes

pkg/event/handler.go Fixed Show fixed Hide fixed

teleivo force-pushed the sse-events-consume branch from 84a6baa to c0dba9f Compare June 15, 2024 04:55

teleivo added the deploy Used to toggle deploying PR branches to the "feature" env. label Jun 17, 2024

teleivo force-pushed the sse-events-consume branch from 9eb98f8 to 6a4d641 Compare June 17, 2024 08:17

teleivo removed deploy Used to toggle deploying PR branches to the "feature" env. labels Jun 17, 2024

teleivo force-pushed the sse-events-consume branch 3 times, most recently from 5f64987 to 687bf7b Compare June 18, 2024 12:55

teleivo added deploy Used to toggle deploying PR branches to the "feature" env. and removed deploy Used to toggle deploying PR branches to the "feature" env. labels Jun 18, 2024

teleivo force-pushed the sse-events-consume branch 8 times, most recently from adfa48e to 659bfcb Compare June 19, 2024 06:54

teleivo requested a review from tonsV2 June 19, 2024 07:21

teleivo force-pushed the sse-events-consume branch 3 times, most recently from 07a6a0e to da30ecd Compare June 19, 2024 08:38

teleivo and others added 4 commits June 20, 2024 09:54

feat: add SSE streaming via /events

366a2b7

fix: return request ID on status 500

3860b71

We originally used slogin but switched away from it due to some limitations. We forgot to get the request id from the context via our own function. It was thus not found in the context.

feat: add SSE streaming via /events

9afce91

chore: fix bad merge

6c35d53

tonsV2 requested changes Jun 20, 2024

View reviewed changes

teleivo added 6 commits June 20, 2024 15:45

chore: code review

faca192

chore: use lowercase error

6a3551b

chore: only support named events

0cee322

chore: typo

290b1a0

docs: add comment

31bec2f

chore: explain string(data)

b66e0f0

teleivo marked this pull request as ready for review June 21, 2024 03:27

teleivo added 4 commits June 21, 2024 05:35

chore: clear up data conversion from RabbitMQ to SSE

029e637

chore: allow resuming with events.sh

b64ecb2

chore: log user, request id

1cd4a86

by passing the gin.Context

chore: only use retention MaxAge of 1h

a431087

teleivo requested a review from radnov June 21, 2024 09:36

teleivo added 4 commits June 21, 2024 11:42

chore: remove comment

65b17a4

its not needed

chore: fix comment

69a2980

chore: every message needs to have a group

1d78d85

so use matchUnfiltered as the consumer should not get messages that have no group

chore: return on error

cf4a6df

the predicates do return false on error but it looks odd that we do not return

tonsV2 requested changes Jun 21, 2024

View reviewed changes

chore: typo

4db85a3

teleivo requested a review from tonsV2 June 21, 2024 11:27

tonsV2 approved these changes Jun 21, 2024

View reviewed changes

radnov approved these changes Jun 21, 2024

View reviewed changes

teleivo enabled auto-merge (squash) June 22, 2024 03:50

teleivo disabled auto-merge June 22, 2024 03:50

teleivo merged commit e941b0a into master Jun 22, 2024
7 checks passed

teleivo deleted the sse-events-consume branch June 22, 2024 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SSE streaming via /events #791

feat: add SSE streaming via /events #791

teleivo commented Jun 13, 2024 •

edited

Loading

radnov left a comment

feat: add SSE streaming via /events #791

feat: add SSE streaming via /events #791

Conversation

teleivo commented Jun 13, 2024 • edited Loading

Use cases

Architecture

Infrastructure Changes

Message Retention

Testing

Docker issues

Additional reading

radnov left a comment

Choose a reason for hiding this comment

teleivo commented Jun 13, 2024 •

edited

Loading