Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus add nodes gauge for SQS mode #1083

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

phuhung273
Copy link

@phuhung273 phuhung273 commented Oct 29, 2024

Issue #, if available:
Close #785

Description of changes:

  • Count nodes/instances being tracked. Eg:
    k get node return 5
    aws ec2 describe-instances return 2
    Identify 3 nodes no longer under NTH control

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@phuhung273 phuhung273 requested a review from a team as a code owner October 29, 2024 14:34
@phuhung273 phuhung273 marked this pull request as draft October 29, 2024 14:34
@phuhung273 phuhung273 force-pushed the prometheus-export-node branch from 6e910d0 to 91e34ac Compare October 30, 2024 14:05
@phuhung273 phuhung273 changed the title [WIP] Prometheus export nodes counter [WIP] IMDS mode: prometheus add nodes counter Nov 1, 2024
@phuhung273 phuhung273 changed the title [WIP] IMDS mode: prometheus add nodes counter [WIP] IMDS mode: prometheus add nodes gauge Nov 1, 2024
@phuhung273 phuhung273 changed the title [WIP] IMDS mode: prometheus add nodes gauge [WIP] SQS mode: prometheus add nodes gauge Nov 11, 2024
@phuhung273
Copy link
Author

@stevehipwell can you please give me a help on this. Now i only stuck at writing unit test for opentelemetry.go since it doesn
t have any test.

Do you think i should refactor this file along with this PR or should I open another separate PR (eg: make opentelemetry.go testable)

@stevehipwell
Copy link
Contributor

@phuhung273 I'm not a maintainer here but I like to see untestable code refactored to be testable, following the boy scout rule. If you do the work in this PR you can always split the refactoring to a separate PR before it's merged.

@LikithaVemulapalli what do you think?

@LikithaVemulapalli
Copy link
Contributor

Yes I agree with @stevehipwell here, let's separate refactoring PR to have a clear idea on the changes made. @phuhung273 if you want to test your changes for this PR let me know I will approve and run so for the future commits you can verify if the existing tests are working, if there are any conflicts or not, once you change the PR status to ready I will run the workflow, appreciate for your contribution. Thanks!

Copy link

github-actions bot commented Dec 4, 2024

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this PR to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Dec 4, 2024
@phuhung273
Copy link
Author

/remove-lifecycle stale

@github-actions github-actions bot removed the stale Issues / PRs with no activity label Dec 6, 2024
Copy link

This PR has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this PR to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Dec 22, 2024
@phuhung273 phuhung273 marked this pull request as ready for review December 26, 2024 15:13
@phuhung273 phuhung273 changed the title [WIP] SQS mode: prometheus add nodes gauge Prometheus add nodes gauge for SQS mode Dec 26, 2024
@Lu-David Lu-David removed the stale Issues / PRs with no activity label Dec 27, 2024
@tiationg-kho tiationg-kho self-assigned this Dec 27, 2024
@tiationg-kho
Copy link
Contributor

Hi @phuhung273, could you resolve the conflicts in pkg/node/node_test.go?

@phuhung273 phuhung273 force-pushed the prometheus-export-node branch from b9e1617 to 95fc5a9 Compare January 1, 2025 02:25
@phuhung273 phuhung273 force-pushed the prometheus-export-node branch from 95fc5a9 to f4ae6de Compare January 1, 2025 02:32
@phuhung273
Copy link
Author

Thanks and Happy new year @tiationg-kho. Conflicts resolved

Copy link
Contributor

@tiationg-kho tiationg-kho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phuhung273,

Thanks for resolving the conflicts. I have left some comments.
Also, we can run the e2e test in local (docker desktop) to make sure our modification is valid.

  • Run all e2e test cases: make e2e-test
  • Run certain e2e test case: ./test/k8s-local-cluster-test/run-test -a ./test/e2e/<test-case> -d -b e2e-test

pkg/ec2helper/ec2helper.go Outdated Show resolved Hide resolved

for {
result, err := h.ec2ServiceClient.DescribeInstances(&ec2.DescribeInstancesInput{
Filters: []*ec2.Filter{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a filter for instance state here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course we can, but I wonder if we should ? I saw cases where instances enter Stopped state instead of Terminated . Without filterring user can discover such case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments.

Maybe we could rename nodesGauge and instancesGauge to let these 2 metrics more self-explain?

Then we could decide we need a filter here or not. And would also know should separate these 2 metrics, or filter one based on another (opentelemetry.go).

pkg/ec2helper/ec2helper.go Outdated Show resolved Hide resolved
pkg/ec2helper/ec2helper.go Outdated Show resolved Hide resolved
pkg/node/node.go Outdated Show resolved Hide resolved
pkg/observability/opentelemetry.go Outdated Show resolved Hide resolved
}
}

func (m Metrics) serveNodeMetrics() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add nil check for both results. (instanceIdsMap, nodeInstanceIds)

We should not use instanceIdsMap in second block if we got error from GetInstanceIdsMapByTagKey.

Consider we filter the nodes result based on instances result, we would not have a chance to record any result (nodes > instances). Do you think this is a potential issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have addressed the 2 nil check.

About the edge cases, I did think of if but could not find anyway to do 2 metrics separately:

  • For instances we already have a NTH managed tag so we can filter base on it
  • For nodes i cannot see any label/annotation that we can filter upon

Therefore, I decided to filter nodes based on instances result. Do you know any info we can use to filter nodes independently

cmd/node-termination-handler.go Outdated Show resolved Hide resolved
cmd/node-termination-handler.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics to show the number of nodes being tracked
5 participants