-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent gets unhealthy temporarily for a long time on removing and re-adding Elastic Defend integration. #3507
Comments
@manishgupta-qasource Please review. |
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Looks like the system integration missed a check in: {
"id": "system/metrics-default",
"type": "system/metrics",
"status": "DEGRADED",
"message": "Degraded: pid '3170' missed 1 check-in",
"units": [
{
"id": "system/metrics-default-system/metrics-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb",
"type": "input",
"status": "HEALTHY",
"message": "Healthy"
},
{
"id": "system/metrics-default",
"type": "output",
"status": "HEALTHY",
"message": "Healthy"
}
]
} while endpoint itself was starting: {
"id": "endpoint-default",
"type": "endpoint",
"status": "STARTING",
"message": "Starting: endpoint service runtime",
"units": [
{
"id": "endpoint-default-25f03ae6-1d34-4048-ba00-0a8be899ccfd",
"type": "input",
"status": "STARTING",
"message": "Starting: endpoint service runtime"
},
{
"id": "endpoint-default",
"type": "output",
"status": "STARTING",
"message": "Starting: endpoint service runtime"
}
]
} |
Looks like this happened a few times to both the system metrics and log inputs:
|
One thing I find a bit weird is that the missed check in is always just after a transition from the configuring state which requires the component to have checked in: {"log.level":"info","@timestamp":"2023-10-04T07:59:45.087Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb (HEALTHY->CONFIGURING): Configuring","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-04T07:59:45.089Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-04T07:59:45.090Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed log-default (HEALTHY->DEGRADED): Degraded: pid '3162' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"} |
@blakerouse does this seem like a possible race condition in how we detect the degraded state? The agent seems to be reported inputs as having missed checkins slightly after logging a unit state transition that would have required a check in. |
@pierrehilbert Yes, this looks like the same issue to me |
Hi Team, We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now. Observations:
Build details: Logs: Hence we are marking this issue as QA:Validated. Thanks! |
|
Kibana Build details:
Host OS: Ubuntu
Preconditions:
Steps to reproduce:
After sometime agent gets back Healthy.
Note:
Screen Recording .zip:
Agent policies - Fleet - Elastic - Google Chrome 2023-10-04 14-12-31.zip
Expected Result:
Agent should remain on removing and re-adding Elastic Defend integration.
Agent Json:
ip-172-31-57-238-agent-details.zip
Logs:
elastic-agent-diagnostics-2023-10-04T08-47-27Z-00.zip
Logs from another agent:
elastic-agent-diagnostics-2023-10-04T09-19-35Z-00.zip
The text was updated successfully, but these errors were encountered: