Skip to content

Concurrent fetch of azure metricdefinitions and batchApi usage #41790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

MichaelKatsoulis
Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis commented Nov 26, 2024

The changes affect azure monitor and relevant metricsets. The list of metricsets affected are:

  • monitor
  • container_registry
  • container_instance
  • container_service
  • compute_vm
  • compute_vm_scaleset
  • database_account
  • storage_account

A new configuration parameter is introduced enable_batch_api of type boolean.
If set to false(default) nothing changes in the way the metrics are collected for these metricsets.

If set to true:

  • The metric definitions of resources are collected asynchronously and write the results in a channel.
  • The channel is read and when the number of definitions collected reach 50 (batch API limit)
  • The metrics definitions are grouped based on criteria(1) and the azure BatchAPI is used to retrieve
    metrics of multiple resources with one api call.
  1. Grouping criteria are
  • Namespace
  • SubscriptionID
  • Location
  • Names
  • TimeGrain
  • Dimensions

Proposed commit message

  • WHAT: Introduce enable_batch_api parameter for concurrent fetching of azure metric definitions and metric values collection using Batch Api
  • WHY: Helps mitigating scalability problems

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@MichaelKatsoulis MichaelKatsoulis requested review from a team as code owners November 26, 2024 12:08
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 26, 2024
@MichaelKatsoulis MichaelKatsoulis marked this pull request as draft November 26, 2024 12:08
Copy link
Contributor

mergify bot commented Nov 26, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b concurrent-fetch-of-azure-metricdefinitions upstream/concurrent-fetch-of-azure-metricdefinitions
git merge upstream/main
git push upstream concurrent-fetch-of-azure-metricdefinitions

Copy link
Contributor

mergify bot commented Nov 26, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @MichaelKatsoulis? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 26, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 26, 2024
@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

Microsoft.DocumentDb/databaseAccounts (1 resource)

resource type: Microsoft.DocumentDb/databaseAccounts
resource count: 1 resource
versions tested:

  • 8.17.1 (branch 8.17)
  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

  • I created one "Azure Cosmos DB for NoSQL", with Provisioned throughput (default settings)
  • I set up the standard Metricbeat database account module
# x-pack/metricbeat/modules.d/azure.yml
- module: azure
  metricsets:
  - database_account
  enabled: true
  period: 300s
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s
  • 8.17.1 and 9.0.0 are creating the same metrics (cardinality and values).

UPDATE: I didn't build the right version, I'm re-testing 9.0.0

8.17.1

CleanShot 2025-01-10 at 13 16 51@2x

9.0.0

  • Data collected regularly: yes

Issues

(1) Timegrain for azure.database_account.create_account.count is empty

CleanShot 2025-01-10 at 15 49 18@2x

In version 8.17.1, the timegrain for this field is PT5M.

(2) The azure.database_account.service_availability.avg (timegrain PT1H) is missing

Version 9.0.0 always collects 7 documents with PT5M, while version 8.17.1 collect 7 documents PT5M + 1 document PT1H during the first iteration and again every 60 mins.

Is 9.0.0 missing the PT1H document on the first iteration? Waiting for the next iteration to double-check.

After 75 mins, no azure.database_account.service_availability.avg field with PT1H.

CleanShot 2025-01-10 at 16 30 53@2x

UPDATE: tested by @MichaelKatsoulis

I managed to collect azure.database_account.service_availability.avg field with PT1H with the PR code. The problem is that the API requests metric values for metrics ServiceAvailability and ReplicationLatency for Average aggregation. When values for both metrics are requested, service_availability.avg is always nil. If we remove the ReplicationLatency and we just request values for ServiceAvailability the service_availability.avg is returned ok! Still do not know the reason of that.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

UPDATE: I built the wrong version, I'm re-testing 9.0.0 with Microsoft.DocumentDb/databaseAccounts (1 resource) and I'll update the previous comment.

My apologies for the noise.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

Microsoft.KeyVault/vaults (10 resources)

resource type: Microsoft.KeyVault/vaults
resource count: 10 resources
versions tested:

  • 8.17.1 (branch 8.17)
  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

  • I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults
- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.KeyVault/vaults'"  
    resource_group:  
    - "mbranca-az-scalability-kv-r10"    
    metrics:  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: StatusCode  
            value: '*'  
          - name: StatusCodeClass  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiLatency  
          - Availability  
          - ServiceApiResult  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - ServiceApiHit  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M  
      - dimensions:  
          - name: ActivityType  
            value: '*'  
          - name: ActivityName  
            value: '*'  
          - name: TransactionType  
            value: '*'  
        ignore_unsupported: true  
        name:  
          - SaturationShoebox  
        namespace: Microsoft.KeyVault/vaults  
        timegrain: PT1M

Notes:

When the key vaults are unused (like in this resource group), they only generates a subset of metrics:

  • Availability
  • API Hits
  • API Results.

8.17.1

In progress.

I can see the three metrics (Availability, API Hits, API Results), grouped in two documents. So 2 documents x 10 resources = 20 documents per iteration:

CleanShot 2025-01-10 at 16 35 28@2x

9.0.0

In progress.

First iterations are okay. I get the same number of documents (20) as 8.17.1 and same values.

CleanShot 2025-01-10 at 16 48 50@2x

Still checking, but this case looks good.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

@MichaelKatsoulis, I found a couple of issues relate to timegrain in the Microsoft.DocumentDb/databaseAccounts (1 resource) test.

@zmoog
Copy link
Contributor

zmoog commented Jan 10, 2025

Microsoft.ContainerRegistry/registries (1 resource)

resource type: Microsoft.ContainerRegistry/registries
resource count: 1 resource
versions tested:

  • 8.17.1 (branch 8.17)
  • 9.0.0 (branch MichaelKatsoulis:concurrent-fetch-of-azure-metricdefinitions)

Activity:

  • I set up a custom Metricbeat config using the Azure Monitor metricset to target the key vaults
- module: azure
  metricsets:
  - container_registry
  enabled: true
  period: 300s
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s

Since we had issue with PT1H metrics, I tried another metricset with this timegrain.

8.17.1

After one iteration, 8.17.1 collected:

  • 1 document with PT5M every 5 minutes
  • 1 document with PT1H every 60 minutes

9.0.0

After one iteration, 8.17.1 collected:

  • 1 document with PT5M every 5 minutes
  • 1 document with PT1H every 60 minutes

Conclusion

✅ With the recent code changes 8.17.1 and 9.0.0 yield the same outcome.

CleanShot 2025-01-15 at 13 23 47@2x

Metrics docs

@MichaelKatsoulis MichaelKatsoulis requested a review from zmoog April 14, 2025 07:19
@MichaelKatsoulis MichaelKatsoulis added backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches and removed backport-8.x Automated backport to the 8.x branch with mergify labels Apr 14, 2025
Copy link
Contributor

@zmoog zmoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance gains from the new sdk/monitor/query/azmetrics package with batch API are extremely compelling.

I would love to simplify the internal structure, but I am also okay with going with the PR as-is, collect customer feedback, and switch to the batch API in the next release.

I added a few non-blocking comments for things we may want to address before merging.

_boolean_
Optional, by default is set to False. Set this to True when facing scalability issues. When configured, the azure batch api will be used
to fetch metrics of multiple resources in one api call.
Currently supported metricsets are monitor, container_registry, container_instance, container_service, compute_vm, compute_vm_scaleset, database_account and storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add storage to the list, or remove it because the metricset supports all the metricsets, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't storage in this list?

@MichaelKatsoulis MichaelKatsoulis merged commit 13f8fde into elastic:main Apr 15, 2025
179 of 182 checks passed
MichaelKatsoulis added a commit that referenced this pull request Apr 22, 2025
…d batchApi usage (#43923)

* Concurrent fetch of azure metricdefinitions and batchApi usage (#41790)

* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum

* Resolve conflicts

---------

Co-authored-by: Michalis Katsoulis <michaelkatsoulis88@gmail.com>
@zmoog zmoog added backport-active-all Automated backport with mergify to all the active branches backport-8.17 Automated backport with mergify backport-8.18 Automated backport to the 8.18 branch backport-8.19 Automated backport to the 8.19 branch labels May 7, 2025
mergify bot pushed a commit that referenced this pull request May 7, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
mergify bot pushed a commit that referenced this pull request May 7, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
mergify bot pushed a commit that referenced this pull request May 7, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
zmoog pushed a commit that referenced this pull request May 9, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
zmoog pushed a commit that referenced this pull request May 15, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
zmoog pushed a commit that referenced this pull request May 15, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
zmoog pushed a commit that referenced this pull request May 19, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
MichaelKatsoulis added a commit that referenced this pull request May 19, 2025
…nd batchApi usage (#44243)

* Concurrent fetch of azure metricdefinitions and batchApi usage (#41790)

* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc

* Resolve conflicts

---------

Co-authored-by: Michalis Katsoulis <michaelkatsoulis88@gmail.com>
zmoog pushed a commit that referenced this pull request May 19, 2025
…nd batchApi usage (#44241)

The changes affect azure monitor and relevant metricsets. The list of metricsets affected are:

- `monitor`
- `container_registry`
- `container_instance`
- `container_service`
- `compute_vm`
- `compute_vm_scaleset`
- `database_account`
- `storage_account`

A new configuration parameter is introduced `enable_batch_api` of type boolean.
If set to `false`(default) nothing changes in the way the metrics are collected for these metricsets.

If set to `true`:

- The metric definitions of resources are collected asynchronously and write the results in a channel.
- The channel is read and when the number of definitions collected reach 50 (batch API limit)
- The metrics definitions are grouped based on criteria(1) and the azure BatchAPI is used to retrieve
metrics of multiple resources with one api call.

1. Grouping criteria are 
- Namespace
- SubscriptionID
- Location
- Names
- TimeGrain
- Dimensions
zmoog pushed a commit that referenced this pull request May 19, 2025
* Use concurrency in metricsdefinition collection

* Change ResourceConfigurations.Metrics to a map

* Use batch API

* New queryResourceClient per location

* Wait for 50 reource ids before fetching the metrics

* Set timegrain if is equal to ''"

* Use batch API as feature

* Use baseclient to tackle code duplication

* Add unit tests for concurrent fetching of metric definitions

* Add batch client unit tests

* Add support of batch API for storage accounts

* Update docs and add unit tests form storage client

* Split metric names by 20

(cherry picked from commit 13f8fde)

# Conflicts:
#	go.mod
#	go.sum
#	metricbeat/docs/modules/azure.asciidoc
zmoog pushed a commit that referenced this pull request May 19, 2025
…nd batchApi usage (#44242)

The changes affect azure monitor and relevant metricsets. The list of metricsets affected are:

- `monitor`
- `container_registry`
- `container_instance`
- `container_service`
- `compute_vm`
- `compute_vm_scaleset`
- `database_account`
- `storage_account`

A new configuration parameter is introduced `enable_batch_api` of type boolean.
If set to `false`(default) nothing changes in the way the metrics are collected for these metricsets.

If set to `true`:

- The metric definitions of resources are collected asynchronously and write the results in a channel.
- The channel is read and when the number of definitions collected reach 50 (batch API limit)
- The metrics definitions are grouped based on criteria(1) and the azure BatchAPI is used to retrieve
metrics of multiple resources with one api call.

1. Grouping criteria are 
- Namespace
- SubscriptionID
- Location
- Names
- TimeGrain
- Dimensions
return fmt.Errorf("no resources were found based on all the configurations options entered")
}

metricStores := make(map[ResDefGroupingCriteria]*MetricStore)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metricStores only work on a single goroutine, so no mutex is needed?

groupedMetrics := map[ResDefGroupingCriteria][]Metric{
criteria: store.GetMetrics(),
}
metricValues := client.GetMetricsInBatch(groupedMetrics, referenceTime, report)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another points, if MetricStore need mutex, here should have an bug when another goroutine AddMetric into store.

bug for array copy will not share the resize action, see playground

But for now, MetricStore is only work in single goroutine, so it work well for now, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.17 Automated backport with mergify backport-8.18 Automated backport to the 8.18 branch backport-8.19 Automated backport to the 8.19 branch backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches backport-active-all Automated backport with mergify to all the active branches Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants