Skip to content

feat: add gzip compression to SQS replay events generated from CloudWatch #887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Kavindu-Dodan
Copy link
Contributor

@Kavindu-Dodan Kavindu-Dodan commented May 9, 2025

What does this PR do?

AWS recently increased CloduWatch message size to 1MB 1. However, SQS message size is still capped at 256KB 2.

This means, replay messages generated from CloudWatch input can cause SQS message limit (see screenshot below)

image

This PR attempts to avoid this edge case by adding gzip compression to SQS message's event payload. While this will not 100% eliminate the limitation, this improvement allows to handle SQS max message size (1MB) if gzip compression ratio of that message reaches at least 4:1 (1MB : 250KB).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.md and updated share/version.py, if my change requires a new release.

How to test this PR locally

  • Build the lambda zip (make package)
  • Deploy using the terraform helper
  • Make a configuration error like invalid api key so that the event get pushed to SQS
  • Retrieve and check the message to see the content to verify gzip compression

I have validated replying same message through Lambda to validate correct parsing of gzipped payloads.

Screenshots

Compressed vs uncompressed message size comparison at SQS queue,

image

Footnotes

  1. https://aws.amazon.com/about-aws/whats-new/2025/04/amazon-cloudwatch-logs-increases-log-event-size-1-mb

  2. https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html

Copy link

github-actions bot commented May 9, 2025

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link

github-actions bot commented May 9, 2025

Minimum allowed coverage is 100%

Generated by 🐒 cobertura-action against 9630a12

@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 6 times, most recently from cb864db to 92eb7a6 Compare May 9, 2025 21:33
@elastic elastic deleted a comment from bturquet May 12, 2025
@elastic elastic deleted a comment from constanca-m May 12, 2025
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 92eb7a6 to dcd5637 Compare May 12, 2025 21:09
@Kavindu-Dodan Kavindu-Dodan marked this pull request as ready for review May 12, 2025 21:56
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from dcd5637 to 2c12cc2 Compare May 12, 2025 22:03
constanca-m
constanca-m previously approved these changes May 12, 2025
Copy link
Contributor

@constanca-m constanca-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the tiny detail of the replay queue, it's looking good.

Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we're only compressing messages from the cloudwatch logs trigger. Should we consider compressing messages from all triggers? I'm not 100% sure this is the right call, but I want to evaluate pros/cons with you.

Having all the messages encoded using the same format may be appropriate for consistency.

@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 3 times, most recently from 900f8c9 to 09b6784 Compare May 15, 2025 19:57
@Kavindu-Dodan
Copy link
Contributor Author

Kavindu-Dodan commented May 15, 2025

I see we're only compressing messages from the cloudwatch logs trigger. Should we consider compressing messages from all triggers? I'm not 100% sure this is the right call, but I want to evaluate pros/cons with you.

Having all the messages encoded using the same format may be appropriate for consistency.

The change applies to CloudWatch triggers as well as for shipper failures for parsed events. The shipping failures apply to all inputs and are handled commonly at utils.py.

Regarding triggers such as sqs, s3 & kinesis, they all use SQS Move, which internally uses a sqs_client.send_message call. But these are distributed into dedicated implementations. I do not want to touch them due to following concerns,

  • The code is already distributed and not unified
  • Unifying requires a bigger refactoring for each trigger
  • Some logic used in triggers is convoluted and a refactoring can introduce unwanted bugs
  • The original concern of message size only applied to the CloudWatch trigger

Anyway the pros I see is,

  • Like you said, unified SQS shipper
  • Reduce complexity of trigger implementations by extracting SQS shipper logic.

@Kavindu-Dodan Kavindu-Dodan requested a review from zmoog May 15, 2025 20:20
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 09b6784 to 7f21f96 Compare May 15, 2025 20:32
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 2 times, most recently from 97c28c4 to 018ed57 Compare May 15, 2025 21:11
@zmoog
Copy link
Contributor

zmoog commented May 20, 2025

@Kavindu-Dodan, I see you decided to only compress the event_payload, while keeping all other fields intact.

So, given the following non-compressed event:

{
  "output_destination": "https://idonotexist.zmoog.dev:443",
  "output_args": {
    "es_datastream_name": "logs-generic-release120script"
  },
  "event_payload": {
    "@timestamp": "2025-05-20T16:59:59.268898Z",
    "tags": [
      "forwarded",
      "generic"
    ],
    "data_stream": {
      "type": "logs",
      "dataset": "generic",
      "namespace": "release120script"
    },
    "event": {
      "dataset": "generic"
    },
    "_op_type": "create",
    "_index": "logs-generic-release120script",
    "_id": "123",
    "message": "Example Event 3",
    "log": {
      "offset": 0,
      "file": {
        "path": "mbranca-test/2025-04-30"
      }
    },
    "aws": {
      "cloudwatch": {
        "log_group": "mbranca-test",
        "log_stream": "2025-04-30",
        "event_id": "38976358746361840955885811431946476608269635496678719490"
      }
    },
    "cloud": {
      "provider": "aws",
      "region": "eu-west-1",
      "account": {
        "id": "123"
      }
    }
  },
  "event_input_id": "arn:aws:logs:eu-west-1:123:log-group:mbranca-test:*"
}

If we store the example event in compressed form, it would become:

{
  "output_destination": "https://idonotexist.zmoog.dev:443",
  "output_args": {
    "es_datastream_name": "logs-generic-release120script"
  },
  "event_payload": "<base64 encoded event payload>",
  "event_input_id": "arn:aws:logs:eu-west-1:123:log-group:mbranca-test:*"
}

right?

In the spirit of keeping things simple, why not compress the whole message? It would be simpler to replace calls to json_parser() with calls to encode_event()?

@Kavindu-Dodan
Copy link
Contributor Author

@zmoog good point and thank you for raising them :)

event formats

Yes what you see is the event format we store. And your comparison is correct ; we only compress event_payload

In the spirit of keeping things simple, why not compress the whole message? It would be simpler to replace calls to json_parser() with calls to encode_event()?

The reason is some pre-parsing happening to detect the event type internal to utils.py [1]. If we compress full event, this logic fails as it rely on first record's entries to detect event type (for sqs replay that's replay-sqs). I wanted to avoid changing this logic hence the approach you see in the PR :)

[1] https://github.com/elastic/elastic-serverless-forwarder/blob/main/handlers/aws/utils.py#L399-L409

@zmoog
Copy link
Contributor

zmoog commented May 21, 2025

The reason is some pre-parsing happening to detect the event type internal to utils.py [1]. If we compress full event, this logic fails as it rely on first record's entries to detect event type (for sqs replay that's replay-sqs). I wanted to avoid changing this logic hence the approach you see in the PR :)

Uhm, maybe I'm wrong, but I feel we're only patching the CloudWatch path. The integration tests focus on this use case, and I'm not sure we're testing the case where we already have a non-compressed message in the queue.

I'm not 100% we are covering all cases and not increasing the complexity handling cloudwatch differently from other triggers.

I'll take some time aside for a deeper look.

@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 018ed57 to 953c0bb Compare May 21, 2025 21:16
@Kavindu-Dodan Kavindu-Dodan requested a review from zmoog May 22, 2025 13:35
Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using SQS message attributes (e.g., Content-Encoding) to flag compressed content to make the flow cleaner and more explicit?

Here's how it could work:

Benefits:

  1. Explicit detection - No need to guess if the content is compressed
  2. Better error handling - Only decompress when the attribute is present
  3. Backward compatibility - Uncompressed messages work unchanged
  4. Performance - Avoid unnecessary decompression attempts

Implementation approach:

  1. Sending side - Add message attribute Content-Encoding when compressing
  2. Receiving side - Check attribute Content-Encoding before decompressing

This eliminates the try/catch fallback logic and makes the compression handling explicit and reliable. The code becomes more maintainable, and the intent is clearer.

@zmoog zmoog self-assigned this May 23, 2025
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 3 times, most recently from 10747c6 to 7d51769 Compare May 27, 2025 17:19
@Kavindu-Dodan Kavindu-Dodan requested a review from zmoog May 27, 2025 18:09
@Kavindu-Dodan
Copy link
Contributor Author

What about using SQS message attributes (e.g., Content-Encoding) to flag compressed content to make the flow cleaner and more explicit?

@zmoog I have implemented this suggestion with 7d51769

Prior to this change, only replay messages had attributes. With this change, we are add the payloadEncoding attribute to the sqs message,

image

I have validated the functionality of the implementation with following workflows,

  • Replay messages without attributes (migrate from old to new implementation and replay sqs messages)
  • Replay messages with attributes
  • Mix of both above

And can confirm it works. Can you have another look ?

@Kavindu-Dodan Kavindu-Dodan requested a review from zmoog May 28, 2025 16:09
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch 3 times, most recently from 931ef93 to 015cd35 Compare May 30, 2025 14:58
…ed back

Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>

revert back the version.py

Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>
Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>
Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>
Signed-off-by: Kavindu Dodanduwa <kavindu.dodanduwa@elastic.co>
@Kavindu-Dodan Kavindu-Dodan force-pushed the feat/gzip-for-replay-queue branch from 015cd35 to 9630a12 Compare May 30, 2025 14:58
@Kavindu-Dodan
Copy link
Contributor Author

@zmoog thank you for all the discussions and reviews on the PR 🙌 Enjoyed working on this.

@Kavindu-Dodan Kavindu-Dodan merged commit f9e95c3 into elastic:main Jun 3, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants