checkpoint_harmony_endpoint: improve handling of 404 and 503 errors #13009

chemamartinez · 2025-03-07T11:16:59Z

Proposed commit message

Changes in the error handling:

- When receiving a 503 Service Unavailable response, the sequence is restarted gratefully, cleaning the task ID and page token and waiting for the next interval.
- When receiving a 404 Not Found, the task ID is requested again for the same timeframe.

Tip

Review commit by commit for a better understanding. The first one just propagates improvements from #12795 and #12158 to the rest of the data streams, while the second contains the new changes.

Note

All data streams have identical cel.yml.hbs files.

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
I have verified that Kibana version constraints are current according to guidelines.
I have verified that any added dashboard complies with Kibana's Dashboard good practices

elasticmachine · 2025-03-07T11:19:11Z

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

elastic-vault-github-plugin-prod · 2025-03-07T11:36:40Z

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

chrisberkhout

Looks good. One point of clean-up with some options.

chrisberkhout · 2025-03-07T16:32:14Z

packages/checkpoint_harmony_endpoint/data_stream/antibot/agent/stream/cel.yml.hbs

+  				// 404 Not Found - Resubmit the task ID query for the same timeframe.
+  				state.with(
+  					{
+  						"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],


When want_more is true, a dummy event is required to make it immediately re-run. If there are no events it will wait for an interval.

Here, want_more is false, so that dummy event is unnecessary.

I think it should either be an empty array:

Suggested change

"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],

"events": [],

or a single event with an error message (single events get logged):

Suggested change

"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],

"events": { "error": {"message": "404: task ID not found" }},

or the same with the polling message, so the event will be logged, but the pipeline will drop it:

Suggested change

"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],

"events": {"message": {"event": {"reason": "polling"}}.encode_json(), "error": {"message": "404: task ID not found"}},

The last one would be easier if the pipeline expected parsed JSON instead of { "message": "JSON string" }, because the CEL input logging code only logs a single message if it's parsed JSON.

I don't have a strong opinion about which one. The later ones are more messy, but get some useful information in the log/ES. The first one is clean and 404s shouldn't be happening a lot, and if they are they can always be seen with request tracing.

Here's a relevant part of the CEL documentation:

The field should be an array, but in the case of an error condition in the CEL program it is acceptable to return a single object instead of an array; this will will be wrapped as an array for publication and an error will be logged. If the single object contains a key, "error", the error value will be used to update the status of the input to report to Elastic Agent. This can be used to more rapidly respond to API failures.

I think you are talking about the 503 case instead of the 404 (that's the one that sets want_more to false).

Thanks for the detailed explanation, I would choose to log the error, as receiving a 503 is not very common and so we can discover it more easily. My question is if we are going to log the error, why not use the same format as for the rest of the errors? I have submitted a commit with the changes, would you mind taking a look at it again?

Yes, you're right it was 503. The change looks good. I think that kind of logged error event could be used for all error cases. Doesn't necessarily have to be now.

For these error cases there are two choices:

return single error event to be logged?

keep that error event in Elasticsearch or drop it from the ingest pipeline?

So for now it's logged and not dropped, which is fine. I think in the future we could have some conventions for those choices.

elasticmachine · 2025-03-10T10:25:04Z

💚 Build Succeeded

Buildkite Build
Commit: a614735

History

💚 Build #23115 succeeded 6dd0a18

cc @chemamartinez

elastic-sonarqube · 2025-03-10T10:25:08Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

chrisberkhout · 2025-03-10T10:40:37Z

packages/checkpoint_harmony_endpoint/data_stream/antibot/agent/stream/cel.yml.hbs

+  				// 404 Not Found - Resubmit the task ID query for the same timeframe.
+  				state.with(
+  					{
+  						"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],


Yes, you're right it was 503. The change looks good. I think that kind of logged error event could be used for all error cases. Doesn't necessarily have to be now.

For these error cases there are two choices:

return single error event to be logged?

keep that error event in Elasticsearch or drop it from the ingest pipeline?

So for now it's logged and not dropped, which is fine. I think in the future we could have some conventions for those choices.

elastic-vault-github-plugin-prod · 2025-03-10T11:22:49Z

Package checkpoint_harmony_endpoint - 0.5.0 containing this change is available at https://epr.elastic.co/package/checkpoint_harmony_endpoint/0.5.0/

chemamartinez added 3 commits March 7, 2025 10:44

Apply CEL improvements in forensics to all data streams

91aa4bf

Improve handling of 404 and 503 API errors

3583633

Bump version

f35e509

chemamartinez added enhancement New feature or request Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] Integration:checkpoint_harmony_endpoint Check Point Harmony Endpoint labels Mar 7, 2025

chemamartinez self-assigned this Mar 7, 2025

Update changelog

6dd0a18

chemamartinez marked this pull request as ready for review March 7, 2025 11:19

chemamartinez requested a review from a team as a code owner March 7, 2025 11:19

chemamartinez requested a review from chrisberkhout March 7, 2025 11:19

chrisberkhout approved these changes Mar 7, 2025

View reviewed changes

Log error when receiving 503 responses

a614735

chemamartinez requested a review from chrisberkhout March 10, 2025 10:15

chrisberkhout approved these changes Mar 10, 2025

View reviewed changes

chemamartinez merged commit 7ba614d into elastic:main Mar 10, 2025
7 checks passed

chemamartinez deleted the checkpoint_harmony_endpoint-cel_fixes branch March 10, 2025 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoint_harmony_endpoint: improve handling of 404 and 503 errors #13009

checkpoint_harmony_endpoint: improve handling of 404 and 503 errors #13009

chemamartinez commented Mar 7, 2025

elasticmachine commented Mar 7, 2025

elastic-vault-github-plugin-prod bot commented Mar 7, 2025

chrisberkhout left a comment

chrisberkhout Mar 7, 2025

chemamartinez Mar 10, 2025

chrisberkhout Mar 10, 2025

elasticmachine commented Mar 10, 2025

elastic-sonarqube bot commented Mar 10, 2025

chrisberkhout Mar 10, 2025

elastic-vault-github-plugin-prod bot commented Mar 10, 2025

	"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
	"events": [],

	"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
	"events": { "error": {"message": "404: task ID not found" }},

	"events": [{"message": {"event": {"reason": "polling"}}.encode_json()}],
	"events": {"message": {"event": {"reason": "polling"}}.encode_json(), "error": {"message": "404: task ID not found"}},

checkpoint_harmony_endpoint: improve handling of 404 and 503 errors #13009

checkpoint_harmony_endpoint: improve handling of 404 and 503 errors #13009

Conversation

chemamartinez commented Mar 7, 2025

Proposed commit message

Checklist

elasticmachine commented Mar 7, 2025

elastic-vault-github-plugin-prod bot commented Mar 7, 2025

🚀 Benchmarks report

chrisberkhout left a comment

Choose a reason for hiding this comment

chrisberkhout Mar 7, 2025

Choose a reason for hiding this comment

chemamartinez Mar 10, 2025

Choose a reason for hiding this comment

chrisberkhout Mar 10, 2025

Choose a reason for hiding this comment

elasticmachine commented Mar 10, 2025

💚 Build Succeeded

History

elastic-sonarqube bot commented Mar 10, 2025

Quality Gate passed

chrisberkhout Mar 10, 2025

Choose a reason for hiding this comment

elastic-vault-github-plugin-prod bot commented Mar 10, 2025