[extension/opampagent] use status subscription for fine granular health reporting #35892

bacherfl · 2024-10-21T08:33:09Z

Description

This PR extends the opamp agent extension to make use of the recently introduced pkg/status to subscribe to health updates, convert them and send them to the opamp server

Link to tracking issue

Fixes #35856

Testing

Added unit tests

…th reporting Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

…lth-report' into feat/35856/opamp-extension-health-report

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

mwear

Thanks for working on this @bacherfl. I have some bigger picture observations / suggestions before getting into the details of the current implementation.

The status.Aggregator needs to consume status events from the component status reporting system. This is done by implementing the optional componentstatus.Watcher interface. I would expect this to be very similar to the healthcheckv2extension. See:

opentelemetry-collector-contrib/extension/healthcheckv2extension/extension.go

Lines 125 to 144 in eb48ed7

    
           // ComponentStatusChanged implements the extension.StatusWatcher interface. 
        
           func (hc *healthCheckExtension) ComponentStatusChanged( 
        
           	source *componentstatus.InstanceID, 
        
           	event *componentstatus.Event, 
        
           ) { 
        
           	// There can be late arriving events after shutdown. We need to close 
        
           	// the event channel so that this function doesn't block and we release all 
        
           	// goroutines, but attempting to write to a closed channel will panic; log 
        
           	// and recover. 
        
           	defer func() { 
        
           		if r := recover(); r != nil { 
        
           			hc.telemetry.Logger.Info( 
        
           				"discarding event received after shutdown", 
        
           				zap.Any("source", source), 
        
           				zap.Any("event", event), 
        
           			) 
        
           		} 
        
           	}() 
        
           	hc.eventCh <- &eventSourcePair{source: source, event: event} 
        
           }

and

opentelemetry-collector-contrib/extension/healthcheckv2extension/extension.go

Lines 168 to 209 in eb48ed7

    
           func (hc *healthCheckExtension) eventLoop(ctx context.Context) { 
        
           	// Record events with component.StatusStarting, but queue other events until 
        
           	// PipelineWatcher.Ready is called. This prevents aggregate statuses from 
        
           	// flapping between StatusStarting and StatusOK as components are started 
        
           	// individually by the service. 
        
           	var eventQueue []*eventSourcePair 
        
           	for loop := true; loop; { 
        
           		select { 
        
           		case esp, ok := <-hc.eventCh: 
        
           			if !ok { 
        
           				return 
        
           			} 
        
           			if esp.event.Status() != componentstatus.StatusStarting { 
        
           				eventQueue = append(eventQueue, esp) 
        
           				continue 
        
           			} 
        
           			hc.aggregator.RecordStatus(esp.source, esp.event) 
        
           		case <-hc.readyCh: 
        
           			for _, esp := range eventQueue { 
        
           				hc.aggregator.RecordStatus(esp.source, esp.event) 
        
           			} 
        
           			eventQueue = nil 
        
           			loop = false 
        
           		case <-ctx.Done(): 
        
           			return 
        
           		} 
        
           	} 
        
           	// After PipelineWatcher.Ready, record statuses as they are received. 
        
           	for { 
        
           		select { 
        
           		case esp, ok := <-hc.eventCh: 
        
           			if !ok { 
        
           				return 
        
           			} 
        
           			hc.aggregator.RecordStatus(esp.source, esp.event) 
        
           		case <-ctx.Done(): 
        
           			return 
        
           		} 
        
           	} 
        
           }

Note the comment below eventLoop for a description as to why the code is structured as it is. Those details may or may not be relevant for the OpAMP extension.

Components can update their status at any time while the collector is running. You'll need to decide if you want to send an updated component health message for each status update, or if you want to periodically retrieve the overall status and send an updated component health message if it's changed. If you want to do the latter, you can use the timestamp on the overall status to see if anything has changed.

If you haven't seen it, there is a document that describes the status reporting system.

Both the status.AggregateStatus and the ComponentHealth message are recursively defined. The existing conversion code assumes they will be two levels deep. It could be future proofed if it accounted for arbitrary levels of nesting.

The last thing I wanted to mention is how the healthcheckv2extension handles the Healthy flag. I'm not sure if any of this applies to the OpAMP extension, but I figured I should mention this while were on this topic. The relationship between component status and whether or not a component is Healthy is somewhat subjective. The healthcheckv2extension has some configuration options to control how this works. They are: include_permanent_errors, include_recoverable_errors, recovery_duration. If include_permanent_errors is set a permanent error will be considered unhealthy. It's similar for include_recoverable_errors, but there's an additional recovery_duration setting where a recoverable error will be considered healthy until the recovery duration has elapsed. I think it's probably safe to ignore this for now, but I wanted to bring it up.

bacherfl · 2024-10-25T05:33:58Z

Thanks for the details @mwear! I was assuming that the Aggregator gathers the health information on its own and the opamp agent just needs to subscribe to the updates with the AggregateStatus messages. I will adapt the PR to implement the ComponentStatusChanged method, and keep the implementation close to how it's done in the health check extension.

Also good point about the nested ComponentHealth messages, I'll adapt the conversion to handle arbitrary levels of nesting.

Will change this PR to draft for now and keep you updated once it's ready for another review

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

# Conflicts: # extension/opampextension/go.mod # extension/opampextension/opamp_agent.go # extension/opampextension/opamp_agent_test.go

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

…sent to opamp agent Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

…lth-report' into feat/35856/opamp-extension-health-report

bacherfl · 2024-10-28T13:59:20Z

@mwear I have adapted the implementation now - when you have time feel free to have a look to see if this goes into the right direction

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

mwear

LGTM. Thanks @bacherfl!

evan-bradley

Sorry for the delay. This mostly looks good to me, just a few minor suggestions.

evan-bradley · 2024-12-03T19:34:04Z

extension/opampextension/opamp_agent.go

+
+	statusAggregator     statusAggregator
+	statusSubscriptionWg *sync.WaitGroup
+	componentHealthWg    *sync.WaitGroup


From what I can tell, we never call Wait on this, so I think we can remove it.

Good catch - seems like the Wait was missing in the Shutdown method. I added it now, to make sure the component health routine exits properly when we receive the shutdown signal

FYI i had to increase some of the timeouts in the e2e tests of the supervisor, as the agent now needs a bit more time to shut down - I will re-trigger the tests a couple of times, to make sure the tests are passing consistently now

alright, it seems like now the tests are passing consistently (the previous run failed due to a temporary issue with the govuln check, but that is unrelated to the changes made in this PR

evan-bradley · 2024-12-03T20:04:47Z

extension/opampextension/opamp_agent.go

@@ -198,6 +215,27 @@ func (o *opampAgent) NotReady() error {
 	return nil
 }

+// ComponentStatusChanged implements the componentstatus.Watcher interface.


Can you add an interface implementation assertion for componentstatus.Watcher above?

evan-bradley · 2024-12-03T20:06:46Z

extension/opampextension/opamp_agent_test.go

@@ -35,6 +44,7 @@ func TestNewOpampAgent(t *testing.T) {
 	assert.True(t, o.capabilities.ReportsHealth)
 	assert.Empty(t, o.effectiveConfig)
 	assert.Nil(t, o.agentDescription)
+	assert.NoError(t, o.Shutdown(context.TODO()))


Should we use context.Background in here instead, or are we planning to pass an actual context object?

# Conflicts: # extension/opampextension/go.mod

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

evan-bradley · 2024-12-05T14:54:02Z

extension/opampextension/opamp_agent.go

 )

-var _ extensioncapabilities.PipelineWatcher = (*opampAgent)(nil)
+var (


Could you move these to the definitions on line 88?

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

evan-bradley

Thanks @bacherfl! Also thank you @mwear for the in-depth review, your approval makes me feel more confident we're implementing this correctly.

…th reporting (open-telemetry#35892)

…36754)  #### Description The changes introduced in #35892 seemed to have introduced some flakyness in the opampsupervisor e2e tests, as the shutdown of the opamp agent waits for the component health loop to end. Due to an unclosed channel within the opamp agent however, the agent does not properly shut down, and the supervisor runs into a timeout before ultimately sending a SIGKILL to the agent process. Closing the channel in the Shutdown method of the opamp extension fixes that and the agent gets shut down properly upon the reception of the SIGINT signal #### Link to tracking Issue: Fixes #36764 #### Testing This fixes the failing test mentioned in the issue (#36764) --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

…th reporting (open-telemetry#35892)

…pen-telemetry#36754)  #### Description The changes introduced in open-telemetry#35892 seemed to have introduced some flakyness in the opampsupervisor e2e tests, as the shutdown of the opamp agent waits for the component health loop to end. Due to an unclosed channel within the opamp agent however, the agent does not properly shut down, and the supervisor runs into a timeout before ultimately sending a SIGKILL to the agent process. Closing the channel in the Shutdown method of the opamp extension fixes that and the agent gets shut down properly upon the reception of the SIGINT signal #### Link to tracking Issue: Fixes open-telemetry#36764 #### Testing This fixes the failing test mentioned in the issue (open-telemetry#36764) --------- Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

[extension/opampagent] use status subscription for fine granular heal…

d6cf1d0

…th reporting Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

github-actions bot added the extension/opamp label Oct 21, 2024

github-actions bot requested review from BinaryFissionGames, evan-bradley, portertech and tigrannajaryan October 21, 2024 08:33

bacherfl added 6 commits October 21, 2024 10:33

Merge branch 'main' into feat/35856/opamp-extension-health-report

f27572a

add changelog entry

8be384e

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

Merge remote-tracking branch 'bacherfl/feat/35856/opamp-extension-hea…

d2fe6b2

…lth-report' into feat/35856/opamp-extension-health-report

fix linting

5fd8e8b

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

fix crosslinking

24b32e3

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

go tidy

cc815b4

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

bacherfl marked this pull request as ready for review October 21, 2024 13:39

bacherfl requested a review from a team as a code owner October 21, 2024 13:39

github-actions bot assigned fatsheep9146 Oct 21, 2024

Merge branch 'main' into feat/35856/opamp-extension-health-report

e5711a6

mwear reviewed Oct 24, 2024

View reviewed changes

bacherfl marked this pull request as draft October 25, 2024 05:35

bacherfl added 8 commits October 25, 2024 11:12

receive component health updates and forward them to status aggregator

2b172ae

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

Merge branch 'main' into feat/35856/opamp-extension-health-report

257145a

# Conflicts: # extension/opampextension/go.mod # extension/opampextension/opamp_agent.go # extension/opampextension/opamp_agent_test.go

fix linting

38ecfee

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

fix data race

2b829b0

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

ensure the component status event loop is started before updates are …

7a47fb8

…sent to opamp agent Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

Merge branch 'main' into feat/35856/opamp-extension-health-report

f280a71

increase test timeout

6cd557f

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

Merge remote-tracking branch 'bacherfl/feat/35856/opamp-extension-hea…

7b92809

…lth-report' into feat/35856/opamp-extension-health-report

github-actions bot added the cmd/opampsupervisor label Oct 28, 2024

github-actions bot requested a review from atoulme October 28, 2024 12:51

bacherfl added 2 commits November 20, 2024 10:21

consolidate health reporting setup into initHealthReporting function

2101748

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

fix linting

c4005bf

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

mwear approved these changes Nov 20, 2024

View reviewed changes

bacherfl added 2 commits November 21, 2024 11:21

Merge branch 'main' into feat/35856/opamp-extension-health-report

bba9f4f

Merge branch 'main' into feat/35856/opamp-extension-health-report

011877b

evan-bradley reviewed Dec 3, 2024

View reviewed changes

bacherfl added 13 commits December 5, 2024 08:47

Merge branch 'main' into feat/35856/opamp-extension-health-report

47c73d0

# Conflicts: # extension/opampextension/go.mod

address pr feedback

4fdaa6c

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

fix linting

7c0ecab

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

trigger CI

9867ff8

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

try to fix failing e2e test

263958c

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

increase timeout to wait before checking collector health endpoint

3b03c3e

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

trigger CI

919261f

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

increase timeout in e2e test

96c6002

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

trigger CI

092721f

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

trigger CI

386afda

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

Merge branch 'main' into feat/35856/opamp-extension-health-report

2c5c524

trigger CI

548ee17

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

trigger CI

e0d9039

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

evan-bradley reviewed Dec 5, 2024

View reviewed changes

move interface assertions

bc6733c

Signed-off-by: Florian Bacher <florian.bacher@dynatrace.com>

evan-bradley approved these changes Dec 6, 2024

View reviewed changes

evan-bradley merged commit d29d065 into open-telemetry:main Dec 6, 2024
158 checks passed

github-actions bot added this to the next release milestone Dec 6, 2024

ZenoCC-Peng pushed a commit to ZenoCC-Peng/opentelemetry-collector-contrib that referenced this pull request Dec 6, 2024

[extension/opampagent] use status subscription for fine granular heal…

da17d70

…th reporting (open-telemetry#35892)

This was referenced Dec 10, 2024

[opampextension] fix blocking agent shutdown due to unclosed channel #36754

Merged

[Failed test][cmd/opampsupervisor] TestSupervisorStopsAgentProcessWithEmptyConfigMap constantly failed #36764

Closed

sbylica-splunk pushed a commit to sbylica-splunk/opentelemetry-collector-contrib that referenced this pull request Dec 17, 2024

[extension/opampagent] use status subscription for fine granular heal…

9927869

…th reporting (open-telemetry#35892)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[extension/opampagent] use status subscription for fine granular health reporting #35892

[extension/opampagent] use status subscription for fine granular health reporting #35892

bacherfl commented Oct 21, 2024

mwear left a comment

bacherfl commented Oct 25, 2024

bacherfl commented Oct 28, 2024

mwear left a comment

evan-bradley left a comment

evan-bradley Dec 3, 2024

bacherfl Dec 5, 2024

bacherfl Dec 5, 2024

bacherfl Dec 5, 2024

evan-bradley Dec 3, 2024

evan-bradley Dec 3, 2024

evan-bradley Dec 5, 2024

evan-bradley left a comment

	// ComponentStatusChanged implements the extension.StatusWatcher interface.
	func (hc *healthCheckExtension) ComponentStatusChanged(
	source *componentstatus.InstanceID,
	event *componentstatus.Event,
	) {
	// There can be late arriving events after shutdown. We need to close
	// the event channel so that this function doesn't block and we release all
	// goroutines, but attempting to write to a closed channel will panic; log
	// and recover.
	defer func() {
	if r := recover(); r != nil {
	hc.telemetry.Logger.Info(
	"discarding event received after shutdown",
	zap.Any("source", source),
	zap.Any("event", event),
	)
	}
	}()
	hc.eventCh <- &eventSourcePair{source: source, event: event}
	}

	func (hc *healthCheckExtension) eventLoop(ctx context.Context) {
	// Record events with component.StatusStarting, but queue other events until
	// PipelineWatcher.Ready is called. This prevents aggregate statuses from
	// flapping between StatusStarting and StatusOK as components are started
	// individually by the service.
	var eventQueue []*eventSourcePair

	for loop := true; loop; {
	select {
	case esp, ok := <-hc.eventCh:
	if !ok {
	return
	}
	if esp.event.Status() != componentstatus.StatusStarting {
	eventQueue = append(eventQueue, esp)
	continue
	}
	hc.aggregator.RecordStatus(esp.source, esp.event)
	case <-hc.readyCh:
	for _, esp := range eventQueue {
	hc.aggregator.RecordStatus(esp.source, esp.event)
	}
	eventQueue = nil
	loop = false
	case <-ctx.Done():
	return
	}
	}

	// After PipelineWatcher.Ready, record statuses as they are received.
	for {
	select {
	case esp, ok := <-hc.eventCh:
	if !ok {
	return
	}
	hc.aggregator.RecordStatus(esp.source, esp.event)
	case <-ctx.Done():
	return
	}
	}
	}

[extension/opampagent] use status subscription for fine granular health reporting #35892

[extension/opampagent] use status subscription for fine granular health reporting #35892

Conversation

bacherfl commented Oct 21, 2024

Description

Link to tracking issue

Testing

mwear left a comment

Choose a reason for hiding this comment

bacherfl commented Oct 25, 2024

bacherfl commented Oct 28, 2024

mwear left a comment

Choose a reason for hiding this comment

evan-bradley left a comment

Choose a reason for hiding this comment

evan-bradley Dec 3, 2024

Choose a reason for hiding this comment

bacherfl Dec 5, 2024

Choose a reason for hiding this comment

bacherfl Dec 5, 2024

Choose a reason for hiding this comment

bacherfl Dec 5, 2024

Choose a reason for hiding this comment

evan-bradley Dec 3, 2024

Choose a reason for hiding this comment

evan-bradley Dec 3, 2024

Choose a reason for hiding this comment

evan-bradley Dec 5, 2024

Choose a reason for hiding this comment

evan-bradley left a comment

Choose a reason for hiding this comment