Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent gets unhealthy temporarily for a long time on removing and re-adding Elastic Defend integration. #3507

Closed
amolnater-qasource opened this issue Oct 4, 2023 · 10 comments · Fixed by #3747
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 8.11.0 SNAPSHOT
BUILD: 67640
COMMIT: 2174a95807a874b292fb3c821445df577ceac4e4

Host OS: Ubuntu

Preconditions:

  1. 8.11.0 SNAPSHOT Kibana cloud environment should be available.
  2. Linux .tar agent should be installed with policy having System and Elastic Defend integrations.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab and observe agent is available with Elastic Defend integration.
  2. Now navigate to the policy and delete Elastic Defend integration.
  3. Wait for 02 minutes and re-add Elastic Defend integration.
  4. Observe agent gets unhealthy for a long time[sometimes upto 10 minutes].
    After sometime agent gets back Healthy.

Note:

  • The issue is not dependent on Agent Tamper protection settings(be it enabled/disabled).
  • The issue is consistently reproducible.

Screen Recording .zip:
Agent policies - Fleet - Elastic - Google Chrome 2023-10-04 14-12-31.zip

Expected Result:
Agent should remain on removing and re-adding Elastic Defend integration.

Agent Json:
ip-172-31-57-238-agent-details.zip

Logs:
elastic-agent-diagnostics-2023-10-04T08-47-27Z-00.zip

Logs from another agent:
elastic-agent-diagnostics-2023-10-04T09-19-35Z-00.zip

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:high Short-term priority; add to current release, or definitely next. labels Oct 4, 2023
@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Oct 4, 2023

Looks like the system integration missed a check in:

    {
      "id": "system/metrics-default",
      "type": "system/metrics",
      "status": "DEGRADED",
      "message": "Degraded: pid '3170' missed 1 check-in",
      "units": [
        {
          "id": "system/metrics-default-system/metrics-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb",
          "type": "input",
          "status": "HEALTHY",
          "message": "Healthy"
        },
        {
          "id": "system/metrics-default",
          "type": "output",
          "status": "HEALTHY",
          "message": "Healthy"
        }
      ]
    }

while endpoint itself was starting:

    {
      "id": "endpoint-default",
      "type": "endpoint",
      "status": "STARTING",
      "message": "Starting: endpoint service runtime",
      "units": [
        {
          "id": "endpoint-default-25f03ae6-1d34-4048-ba00-0a8be899ccfd",
          "type": "input",
          "status": "STARTING",
          "message": "Starting: endpoint service runtime"
        },
        {
          "id": "endpoint-default",
          "type": "output",
          "status": "STARTING",
          "message": "Starting: endpoint service runtime"
        }
      ]
    }

@cmacknz
Copy link
Member

cmacknz commented Oct 4, 2023

Looks like this happened a few times to both the system metrics and log inputs:

~/Downloads/elastic-agent-diagnostics-2023-10-04T08-47-27Z-00 ················· ✘ INT 03:59:58 PM
❯ rg missed logs
logs/elastic-agent-35dbbd/elastic-agent-20231004-5.ndjson
3220:{"log.level":"warn","@timestamp":"2023-10-04T08:43:27.461Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed system/metrics-default (HEALTHY->DEGRADED): Degraded: pid '3170' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

logs/elastic-agent-35dbbd/elastic-agent-20231004-3.ndjson
3053:{"log.level":"warn","@timestamp":"2023-10-04T08:15:18.669Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed http/metrics-monitoring (HEALTHY->DEGRADED): Degraded: pid '3176' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

logs/elastic-agent-35dbbd/elastic-agent-20231004-2.ndjson
341:{"log.level":"warn","@timestamp":"2023-10-04T07:59:45.090Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed system/metrics-default (HEALTHY->DEGRADED): Degraded: pid '3170' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"system/metrics-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
342:{"log.level":"warn","@timestamp":"2023-10-04T07:59:45.090Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed log-default (HEALTHY->DEGRADED): Degraded: pid '3162' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
10314:{"log.level":"warn","@timestamp":"2023-10-04T08:07:09.634Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed filestream-monitoring (HEALTHY->DEGRADED): Degraded: pid '3184' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"filestream-monitoring","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
10315:{"log.level":"warn","@timestamp":"2023-10-04T08:07:09.635Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed log-default (HEALTHY->DEGRADED): Degraded: pid '3162' missed 1 check-in","log":{"source":"elastic-agent"},"

@cmacknz
Copy link
Member

cmacknz commented Oct 4, 2023

One thing I find a bit weird is that the missed check in is always just after a transition from the configuring state which requires the component to have checked in:

{"log.level":"info","@timestamp":"2023-10-04T07:59:45.087Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb (HEALTHY->CONFIGURING): Configuring","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb","type":"input","state":"CONFIGURING","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-04T07:59:45.089Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":558},"message":"Unit state changed log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb (CONFIGURING->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"HEALTHY"},"unit":{"id":"log-default-logfile-system-7dd0aa20-628a-11ee-a161-d1f999cbf8bb","type":"input","state":"HEALTHY","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-10-04T07:59:45.090Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":540},"message":"Component state changed log-default (HEALTHY->DEGRADED): Degraded: pid '3162' missed 1 check-in","log":{"source":"elastic-agent"},"component":{"id":"log-default","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}

@cmacknz
Copy link
Member

cmacknz commented Oct 4, 2023

@blakerouse does this seem like a possible race condition in how we detect the degraded state? The agent seems to be reported inputs as having missed checkins slightly after logging a unit state transition that would have required a check in.

@pierrehilbert
Copy link
Contributor

@faec It seems to be related to #3738 too or am I missing something?

@faec
Copy link
Contributor

faec commented Nov 10, 2023

@pierrehilbert Yes, this looks like the same issue to me

@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Nov 19, 2023
@amolnater-qasource
Copy link
Author

amolnater-qasource commented Jan 11, 2024

Hi Team,

We have revalidated this issue on latest 8.12.0 BC5 kibana cloud environment and found it fixed now.

Observations:

  • Agent remains Healthy on removing and re-adding Elastic Defend integration.
  • Inconsistently got unhealthy once, however we aren't able to reproduce it later.

Build details:
VERSION: 8.12.0 BC5
BUILD: 70053
COMMIT: db9b8921b37139cbb1e11d23f6381f655edeb72b
Artifact Link: https://staging.elastic.co/8.12.0-9f05a310/summary-8.12.0.html

Screenshot:
image

Logs:
elastic-agent-diagnostics-2024-01-11T08-05-06Z-00.zip

Hence we are marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Jan 11, 2024
@harshitgupta-qasource
Copy link

Bug Conversion

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants