Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint_harmony_endpoint: improve handling of 404 and 503 errors #13009

Merged

Conversation

chemamartinez
Copy link
Contributor

Proposed commit message

Changes in the error handling:

- When receiving a 503 Service Unavailable response, the sequence is restarted gratefully, cleaning the task ID and page token and waiting for the next interval.
- When receiving a 404 Not Found, the task ID is requested again for the same timeframe.

Tip

Review commit by commit for a better understanding. The first one just propagates improvements from #12795 and #12158 to the rest of the data streams, while the second contains the new changes.

Note

All data streams have identical cel.yml.hbs files.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

@chemamartinez chemamartinez added enhancement New feature or request Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] Integration:checkpoint_harmony_endpoint Check Point Harmony Endpoint labels Mar 7, 2025
@chemamartinez chemamartinez self-assigned this Mar 7, 2025
@chemamartinez chemamartinez marked this pull request as ready for review March 7, 2025 11:19
@chemamartinez chemamartinez requested a review from a team as a code owner March 7, 2025 11:19
@elasticmachine
Copy link

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@elastic-vault-github-plugin-prod

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

Copy link
Contributor

@chrisberkhout chrisberkhout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One point of clean-up with some options.

// 404 Not Found - Resubmit the task ID query for the same timeframe.
state.with(
{
"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When want_more is true, a dummy event is required to make it immediately re-run. If there are no events it will wait for an interval.

Here, want_more is false, so that dummy event is unnecessary.

I think it should either be an empty array:

Suggested change
"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
"events": [],

or a single event with an error message (single events get logged):

Suggested change
"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
"events": { "error": {"message": "404: task ID not found" }},

or the same with the polling message, so the event will be logged, but the pipeline will drop it:

Suggested change
"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
"events": {"message": {"event": {"reason": "polling"}}.encode_json(), "error": {"message": "404: task ID not found"}},

The last one would be easier if the pipeline expected parsed JSON instead of { "message": "JSON string" }, because the CEL input logging code only logs a single message if it's parsed JSON.

I don't have a strong opinion about which one. The later ones are more messy, but get some useful information in the log/ES. The first one is clean and 404s shouldn't be happening a lot, and if they are they can always be seen with request tracing.

Here's a relevant part of the CEL documentation:

The field should be an array, but in the case of an error condition in the CEL program it is acceptable to return a single object instead of an array; this will will be wrapped as an array for publication and an error will be logged. If the single object contains a key, "error", the error value will be used to update the status of the input to report to Elastic Agent. This can be used to more rapidly respond to API failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are talking about the 503 case instead of the 404 (that's the one that sets want_more to false).

Thanks for the detailed explanation, I would choose to log the error, as receiving a 503 is not very common and so we can discover it more easily. My question is if we are going to log the error, why not use the same format as for the rest of the errors? I have submitted a commit with the changes, would you mind taking a look at it again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right it was 503. The change looks good. I think that kind of logged error event could be used for all error cases. Doesn't necessarily have to be now.

For these error cases there are two choices:

  • return single error event to be logged?
  • keep that error event in Elasticsearch or drop it from the ingest pipeline?

So for now it's logged and not dropped, which is fine. I think in the future we could have some conventions for those choices.

@elasticmachine
Copy link

💚 Build Succeeded

History

cc @chemamartinez

// 404 Not Found - Resubmit the task ID query for the same timeframe.
state.with(
{
"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right it was 503. The change looks good. I think that kind of logged error event could be used for all error cases. Doesn't necessarily have to be now.

For these error cases there are two choices:

  • return single error event to be logged?
  • keep that error event in Elasticsearch or drop it from the ingest pipeline?

So for now it's logged and not dropped, which is fine. I think in the future we could have some conventions for those choices.

@chemamartinez chemamartinez merged commit 7ba614d into elastic:main Mar 10, 2025
7 checks passed
@chemamartinez chemamartinez deleted the checkpoint_harmony_endpoint-cel_fixes branch March 10, 2025 11:05
@elastic-vault-github-plugin-prod

Package checkpoint_harmony_endpoint - 0.5.0 containing this change is available at https://epr.elastic.co/package/checkpoint_harmony_endpoint/0.5.0/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:checkpoint_harmony_endpoint Check Point Harmony Endpoint Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants