Agent gets unhealthy temporarily for a long time on removing and re-adding Elastic Defend integration. #3507

amolnater-qasource · 2023-10-04T09:52:00Z

Kibana Build details:

VERSION: 8.11.0 SNAPSHOT
BUILD: 67640
COMMIT: 2174a95807a874b292fb3c821445df577ceac4e4

Host OS: Ubuntu

Preconditions:

8.11.0 SNAPSHOT Kibana cloud environment should be available.
Linux .tar agent should be installed with policy having System and Elastic Defend integrations.

Steps to reproduce:

Navigate to Fleet>Agents tab and observe agent is available with Elastic Defend integration.
Now navigate to the policy and delete Elastic Defend integration.
Wait for 02 minutes and re-add Elastic Defend integration.
Observe agent gets unhealthy for a long time[sometimes upto 10 minutes].
After sometime agent gets back Healthy.

Note:

The issue is not dependent on Agent Tamper protection settings(be it enabled/disabled).
The issue is consistently reproducible.

Screen Recording .zip:
Agent policies - Fleet - Elastic - Google Chrome 2023-10-04 14-12-31.zip

Expected Result:
Agent should remain on removing and re-adding Elastic Defend integration.

Agent Json:
ip-172-31-57-238-agent-details.zip

Logs:
elastic-agent-diagnostics-2023-10-04T08-47-27Z-00.zip

Logs from another agent:
elastic-agent-diagnostics-2023-10-04T09-19-35Z-00.zip

The text was updated successfully, but these errors were encountered:

amolnater-qasource · 2023-10-04T09:52:12Z

@manishgupta-qasource Please review.

elasticmachine · 2023-10-04T09:52:25Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz · 2023-10-04T19:56:40Z

Looks like the system integration missed a check in:

    {
      "id": "system/metrics-default",
      "type": "system/metrics",
      "status": "DEGRADED",
      "message": "Degraded: pid '3170' missed 1 check-in",
      "units": [
        {
          "id": "system/metrics-default-system/metrics-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "system/metrics-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    }

while endpoint itself was starting:

    {
      "id": "endpoint-default",
      "type": "endpoint",
      "status": "STARTING",
      "message": "Starting: endpoint service runtime",
      "units": [
        {
          "id": "endpoint-default-25f03ae6-1d34-4048-ba00-0a8be899ccfd",
          "type": "input",
          "status": "STARTING",
          "message": "Starting: endpoint service runtime"
        },
        {
          "id": "endpoint-default",
          "type": "output",
          "status": "STARTING",
          "message": "Starting: endpoint service runtime"
        }
      ]
    }

cmacknz · 2023-10-04T20:02:37Z

Looks like this happened a few times to both the system metrics and log inputs:

~/Downloads/elastic-agent-diagnostics-2023-10-04T08-47-27Z-00 ················· ✘ INT 03:59:58 PM
❯ rg missed logs
logs/elastic-agent-35dbbd/elastic-agent-20231004-5.ndjson
3220:{"log.level":"warn","@timestamp":"2023-10-04T08:43:27.461Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed system/metrics-default (HEALTHY->DEGRADED): Degraded: pid '3170' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

logs/elastic-agent-35dbbd/elastic-agent-20231004-3.ndjson
3053:{"log.level":"warn","@timestamp":"2023-10-04T08:15:18.669Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed http/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '3176' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

logs/elastic-agent-35dbbd/elastic-agent-20231004-2.ndjson
341:{"log.level":"warn","@timestamp":"2023-10-04T07:59:45.090Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed system/metrics-default (HEALTHY->DEGRADED): Degraded: pid '3170' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
342:{"log.level":"warn","@timestamp":"2023-10-04T07:59:45.090Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed log-default (HEALTHY->DEGRADED): Degraded: pid '3162' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
10314:{"log.level":"warn","@timestamp":"2023-10-04T08:07:09.634Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '3184' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
10315:{"log.level":"warn","@timestamp":"2023-10-04T08:07:09.635Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed log-default (HEALTHY->DEGRADED): Degraded: pid '3162' missed 1 check-in","log":{"source":"elastic-agent"},"

cmacknz · 2023-10-04T20:04:08Z

One thing I find a bit weird is that the missed check in is always just after a transition from the configuring state which requires the component to have checked in:

{"log.level":"info","@timestamp":"2023-10-04T07:59:45.087Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb (HEALTHY->CONFIGURING): Configuring","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-04T07:59:45.089Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-04T07:59:45.090Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed log-default (HEALTHY->DEGRADED): Degraded: pid '3162' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

cmacknz · 2023-10-04T20:05:17Z

@blakerouse does this seem like a possible race condition in how we detect the degraded state? The agent seems to be reported inputs as having missed checkins slightly after logging a unit state transition that would have required a check in.

pierrehilbert · 2023-11-10T11:00:22Z

@faec It seems to be related to #3738 too or am I missing something?

faec · 2023-11-10T12:17:37Z

@pierrehilbert Yes, this looks like the same issue to me

amolnater-qasource · 2024-01-11T08:12:24Z

Hi Team,

We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now.

Observations:

Agent remains Healthy on removing and re-adding Elastic Defend integration.
Inconsistently got unhealthy once, however we aren't able to reproduce it later.

Build details:
VERSION: 8.12.0 BC5
BUILD: 70053
COMMIT: db9b8921b37139cbb1e11d23f6381f655edeb72b
Artifact Link: https://staging.elastic.co/8.12.0-9f05a310/summary-8.12.0.html

Screenshot:

Logs:
elastic-agent-diagnostics-2024-01-11T08-05-06Z-00.zip

Hence we are marking this issue as QA:Validated.

Thanks!

harshitgupta-qasource · 2024-01-25T11:07:12Z

`Bug Conversion`

Test-Case not required as this particular checkpoint is already covered in the following testcase:
- https://elastic.testrail.io/index.php?/cases/view/28390

Thanks!

amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:high Short-term priority; add to current release, or definitely next. labels Oct 4, 2023

amolnater-qasource mentioned this issue Oct 17, 2023

Agent gets unhealthy on assigning from policy with Elastic Defend integration to without Defend integration. #3617

Closed

faec mentioned this issue Nov 15, 2023

Rework runtime manager updates to block the coordinator less #3747

Merged

7 tasks

faec closed this as completed in #3747 Nov 16, 2023

amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Nov 19, 2023

amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent gets unhealthy temporarily for a long time on removing and re-adding Elastic Defend integration. #3507

Agent gets unhealthy temporarily for a long time on removing and re-adding Elastic Defend integration. #3507

amolnater-qasource commented Oct 4, 2023

amolnater-qasource commented Oct 4, 2023

elasticmachine commented Oct 4, 2023

cmacknz commented Oct 4, 2023

cmacknz commented Oct 4, 2023

cmacknz commented Oct 4, 2023

cmacknz commented Oct 4, 2023

pierrehilbert commented Nov 10, 2023

faec commented Nov 10, 2023

amolnater-qasource commented Jan 11, 2024 •

edited

Loading

harshitgupta-qasource commented Jan 25, 2024

Agent gets unhealthy temporarily for a long time on removing and re-adding Elastic Defend integration. #3507

Agent gets unhealthy temporarily for a long time on removing and re-adding Elastic Defend integration. #3507

Comments

amolnater-qasource commented Oct 4, 2023

amolnater-qasource commented Oct 4, 2023

elasticmachine commented Oct 4, 2023

cmacknz commented Oct 4, 2023

cmacknz commented Oct 4, 2023

cmacknz commented Oct 4, 2023

cmacknz commented Oct 4, 2023

pierrehilbert commented Nov 10, 2023

faec commented Nov 10, 2023

amolnater-qasource commented Jan 11, 2024 • edited Loading

harshitgupta-qasource commented Jan 25, 2024

Bug Conversion

amolnater-qasource commented Jan 11, 2024 •

edited

Loading

`Bug Conversion`