feat: add gzip compression to SQS replay events generated from CloudWatch #887

Kavindu-Dodan · 2025-05-09T20:46:20Z

What does this PR do?

AWS recently increased CloduWatch message size to 1MB ¹. However, SQS message size is still capped at 256KB ².

This means, replay messages generated from CloudWatch input can cause SQS message limit (see screenshot below)

This PR attempts to avoid this edge case by adding gzip compression to SQS message's event payload. While this will not 100% eliminate the limitation, this improvement allows to handle SQS max message size (1MB) if gzip compression ratio of that message reaches at least 4:1 (1MB : 250KB).

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.md and updated share/version.py, if my change requires a new release.

How to test this PR locally

Build the lambda zip (make package)
Deploy using the terraform helper
Make a configuration error like invalid api key so that the event get pushed to SQS
Retrieve and check the message to see the content to verify gzip compression

I have validated replying same message through Lambda to validate correct parsing of gzipped payloads.

Screenshots

Compressed vs uncompressed message size comparison at SQS queue,

github-actions · 2025-05-09T20:46:31Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

github-actions · 2025-05-09T20:47:46Z

Minimum allowed coverage is 100%

Generated by 🐒 cobertura-action against 9630a12

handlers/aws/handler.py

constanca-m

Apart from the tiny detail of the replay queue, it's looking good.

share/version.py

handlers/aws/handler.py

handlers/aws/utils.py

zmoog

I see we're only compressing messages from the cloudwatch logs trigger. Should we consider compressing messages from all triggers? I'm not 100% sure this is the right call, but I want to evaluate pros/cons with you.

Having all the messages encoded using the same format may be appropriate for consistency.

handlers/aws/utils.py

Kavindu-Dodan · 2025-05-15T20:18:41Z

I see we're only compressing messages from the cloudwatch logs trigger. Should we consider compressing messages from all triggers? I'm not 100% sure this is the right call, but I want to evaluate pros/cons with you.

Having all the messages encoded using the same format may be appropriate for consistency.

The change applies to CloudWatch triggers as well as for shipper failures for parsed events. The shipping failures apply to all inputs and are handled commonly at utils.py.

Regarding triggers such as sqs, s3 & kinesis, they all use SQS Move, which internally uses a sqs_client.send_message call. But these are distributed into dedicated implementations. I do not want to touch them due to following concerns,

The code is already distributed and not unified
Unifying requires a bigger refactoring for each trigger
Some logic used in triggers is convoluted and a refactoring can introduce unwanted bugs
The original concern of message size only applied to the CloudWatch trigger

Anyway the pros I see is,

Like you said, unified SQS shipper
Reduce complexity of trigger implementations by extracting SQS shipper logic.

handlers/aws/utils.py

zmoog · 2025-05-20T17:33:46Z

@Kavindu-Dodan, I see you decided to only compress the event_payload, while keeping all other fields intact.

So, given the following non-compressed event:

{
  "output_destination": "https://idonotexist.zmoog.dev:443",
  "output_args": {
    "es_datastream_name": "logs-generic-release120script"
  },
  "event_payload": {
    "@timestamp": "2025-05-20T16:59:59.268898Z",
    "tags": [
      "forwarded",
      "generic"
    ],
    "data_stream": {
      "type": "logs",
      "dataset": "generic",
      "namespace": "release120script"
    },
    "event": {
      "dataset": "generic"
    },
    "_op_type": "create",
    "_index": "logs-generic-release120script",
    "_id": "123",
    "message": "Example Event 3",
    "log": {
      "offset": 0,
      "file": {
        "path": "mbranca-test/2025-04-30"
      }
    },
    "aws": {
      "cloudwatch": {
        "log_group": "mbranca-test",
        "log_stream": "2025-04-30",
        "event_id": "38976358746361840955885811431946476608269635496678719490"
      }
    },
    "cloud": {
      "provider": "aws",
      "region": "eu-west-1",
      "account": {
        "id": "123"
      }
    }
  },
  "event_input_id": "arn:aws:logs:eu-west-1:123:log-group:mbranca-test:*"
}

If we store the example event in compressed form, it would become:

{
  "output_destination": "https://idonotexist.zmoog.dev:443",
  "output_args": {
    "es_datastream_name": "logs-generic-release120script"
  },
  "event_payload": "<base64 encoded event payload>",
  "event_input_id": "arn:aws:logs:eu-west-1:123:log-group:mbranca-test:*"
}

right?

In the spirit of keeping things simple, why not compress the whole message? It would be simpler to replace calls to json_parser() with calls to encode_event()?

Kavindu-Dodan · 2025-05-20T18:32:48Z

@zmoog good point and thank you for raising them :)

event formats

Yes what you see is the event format we store. And your comparison is correct ; we only compress event_payload

In the spirit of keeping things simple, why not compress the whole message? It would be simpler to replace calls to json_parser() with calls to encode_event()?

The reason is some pre-parsing happening to detect the event type internal to utils.py [1]. If we compress full event, this logic fails as it rely on first record's entries to detect event type (for sqs replay that's replay-sqs). I wanted to avoid changing this logic hence the approach you see in the PR :)

[1] https://github.com/elastic/elastic-serverless-forwarder/blob/main/handlers/aws/utils.py#L399-L409

zmoog · 2025-05-21T11:54:22Z

The reason is some pre-parsing happening to detect the event type internal to utils.py [1]. If we compress full event, this logic fails as it rely on first record's entries to detect event type (for sqs replay that's replay-sqs). I wanted to avoid changing this logic hence the approach you see in the PR :)

Uhm, maybe I'm wrong, but I feel we're only patching the CloudWatch path. The integration tests focus on this use case, and I'm not sure we're testing the case where we already have a non-compressed message in the queue.

I'm not 100% we are covering all cases and not increasing the complexity handling cloudwatch differently from other triggers.

I'll take some time aside for a deeper look.

zmoog

What about using SQS message attributes (e.g., Content-Encoding) to flag compressed content to make the flow cleaner and more explicit?

Here's how it could work:

Benefits:

Explicit detection - No need to guess if the content is compressed
Better error handling - Only decompress when the attribute is present
Backward compatibility - Uncompressed messages work unchanged
Performance - Avoid unnecessary decompression attempts

Implementation approach:

Sending side - Add message attribute Content-Encoding when compressing
Receiving side - Check attribute Content-Encoding before decompressing

This eliminates the try/catch fallback logic and makes the compression handling explicit and reliable. The code becomes more maintainable, and the intent is clearer.

Kavindu-Dodan · 2025-05-27T18:12:45Z

What about using SQS message attributes (e.g., Content-Encoding) to flag compressed content to make the flow cleaner and more explicit?

@zmoog I have implemented this suggestion with 7d51769

Prior to this change, only replay messages had attributes. With this change, we are add the payloadEncoding attribute to the sqs message,

I have validated the functionality of the implementation with following workflows,

Replay messages without attributes (migrate from old to new implementation and replay sqs messages)
Replay messages with attributes
Mix of both above

And can confirm it works. Can you have another look ?

handlers/aws/handler.py

handlers/aws/utils.py

handlers/aws/cloudwatch_logs_trigger.py

…ed back Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co> revert back the version.py Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>

Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>

Kavindu-Dodan · 2025-06-03T21:16:15Z

@zmoog thank you for all the discussions and reviews on the PR 🙌 Enjoyed working on this.

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 6 times, most recently from cb864db to 92eb7a6 Compare May 9, 2025 21:33

elastic deleted a comment from bturquet May 12, 2025

elastic deleted a comment from constanca-m May 12, 2025

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 92eb7a6 to dcd5637 Compare May 12, 2025 21:09

Kavindu-Dodan marked this pull request as ready for review May 12, 2025 21:56

Kavindu-Dodan requested review from zmoog and kaiyan-sheng May 12, 2025 21:56

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from dcd5637 to 2c12cc2 Compare May 12, 2025 22:03

constanca-m reviewed May 12, 2025

View reviewed changes

handlers/aws/handler.py Outdated Show resolved Hide resolved

constanca-m previously approved these changes May 12, 2025

View reviewed changes

zmoog reviewed May 13, 2025

View reviewed changes

share/version.py Outdated Show resolved Hide resolved

handlers/aws/handler.py Outdated Show resolved Hide resolved

handlers/aws/utils.py Outdated Show resolved Hide resolved

Kavindu-Dodan dismissed constanca-m’s stale review via 9f112f0 May 13, 2025 15:37

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 9f112f0 to 22c6174 Compare May 13, 2025 15:38

Kavindu-Dodan requested review from constanca-m and zmoog May 13, 2025 16:01

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 22c6174 to a538fb1 Compare May 13, 2025 18:11

zmoog reviewed May 15, 2025

View reviewed changes

handlers/aws/utils.py Outdated Show resolved Hide resolved

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 3 times, most recently from 900f8c9 to 09b6784 Compare May 15, 2025 19:57

Kavindu-Dodan requested a review from zmoog May 15, 2025 20:20

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 09b6784 to 7f21f96 Compare May 15, 2025 20:32

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 2 times, most recently from 97c28c4 to 018ed57 Compare May 15, 2025 21:11

zmoog reviewed May 20, 2025

View reviewed changes

handlers/aws/utils.py Outdated Show resolved Hide resolved

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 018ed57 to 953c0bb Compare May 21, 2025 21:16

Kavindu-Dodan requested a review from zmoog May 22, 2025 13:35

zmoog reviewed May 23, 2025

View reviewed changes

zmoog added the Team:obs-ds-hosted-services label May 23, 2025

zmoog self-assigned this May 23, 2025

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 3 times, most recently from 10747c6 to 7d51769 Compare May 27, 2025 17:19

Kavindu-Dodan requested a review from zmoog May 27, 2025 18:09

zmoog reviewed May 28, 2025

View reviewed changes

handlers/aws/handler.py Show resolved Hide resolved

handlers/aws/utils.py Show resolved Hide resolved

handlers/aws/cloudwatch_logs_trigger.py Outdated Show resolved Hide resolved

Kavindu-Dodan requested a review from zmoog May 28, 2025 16:09

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 3 times, most recently from 931ef93 to 015cd35 Compare May 30, 2025 14:58

Kavindu-Dodan added 4 commits May 30, 2025 07:58

add gzip compression to sqs replay events and decompress when trigger…

65a709d

…ed back Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co> revert back the version.py Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>

review - unify decoding and decompression

6065dd0

Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>

review : use message attributes

beb7ef8

Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>

review : add integration test for sqs replay

9630a12

Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>

Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 015cd35 to 9630a12 Compare May 30, 2025 14:58

zmoog approved these changes Jun 3, 2025

View reviewed changes

Kavindu-Dodan merged commit f9e95c3 into elastic:main Jun 3, 2025
7 checks passed

feat: add gzip compression to SQS replay events generated from CloudWatch #887

feat: add gzip compression to SQS replay events generated from CloudWatch #887

Uh oh!

Conversation

Kavindu-Dodan commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist

How to test this PR locally

Screenshots

Footnotes

Uh oh!

github-actions bot commented May 9, 2025

🤖 GitHub comments

Uh oh!

github-actions bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

constanca-m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zmoog left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kavindu-Dodan commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zmoog commented May 20, 2025

Uh oh!

Kavindu-Dodan commented May 20, 2025

Uh oh!

zmoog commented May 21, 2025

Uh oh!

zmoog left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kavindu-Dodan commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kavindu-Dodan commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Kavindu-Dodan commented May 9, 2025 •

edited

Loading

github-actions bot commented May 9, 2025 •

edited

Loading

Kavindu-Dodan commented May 15, 2025 •

edited

Loading

zmoog left a comment •

edited

Loading