Skip to content

chore(datadog_metrics sink): Re-enable v2 endpoint#24842

Draft
vladimir-dd wants to merge 6 commits intomasterfrom
vladimir-dd/metrics-v2
Draft

chore(datadog_metrics sink): Re-enable v2 endpoint#24842
vladimir-dd wants to merge 6 commits intomasterfrom
vladimir-dd/metrics-v2

Conversation

@vladimir-dd
Copy link
Contributor

@vladimir-dd vladimir-dd commented Mar 4, 2026

This PR switches the Datadog metrics sink to use the v2 series endpoint by default, caps the batcher to fix a memory regression, and adds a regression test.

Memory: v2's smaller payload limit (5 MiB vs 60 MiB for v1) caused a +57–59% RSS regression at 50–100k events/s. Fixed by capping the batcher to the endpoint's payload limit — memory drops to near-parity with v1. Full benchmark results: #24874.

Correctness: End-to-end validated against the real DataDog API (#24879) — all metric types (counter, gauge, set, distribution, aggregated histogram, aggregated summary) pass for both v1 and v2. 36/36 checks pass. v1 and v2 produce identical aggregated values.

@vladimir-dd vladimir-dd requested a review from a team as a code owner March 4, 2026 12:39
@github-actions github-actions bot added the domain: sinks Anything related to the Vector's sinks label Mar 4, 2026
@vladimir-dd vladimir-dd changed the title Switch to Datadog Metrics v2 endpoint chore(datadog_metrics sink): Re-enable v2 endpoint Mar 4, 2026
@vladimir-dd vladimir-dd marked this pull request as draft March 4, 2026 13:03
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some early feedback, this looks good overall.

It would be interesting to run some stress tests, it would be interesting to compare against v1.

There's an existing regression experiment here:

https://github.com/search?q=repo%3Avectordotdev%2Fvector+path%3A%2F%5Eregression%5C%2F%2F+datadog_metrics&type=code

@datadog-vectordotdev
Copy link

datadog-vectordotdev bot commented Mar 4, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 55877a4 | Docs | Was this helpful? Give us feedback!

@vladimir-dd
Copy link
Contributor Author

Leaving some early feedback, this looks good overall.

It would be interesting to run some stress tests, it would be interesting to compare against v1.

There's an existing regression experiment here:

https://github.com/search?q=repo%3Avectordotdev%2Fvector+path%3A%2F%5Eregression%5C%2F%2F+datadog_metrics&type=code

thanks, was about to ask for some guidance for performance testing here 🙏

@vladimir-dd vladimir-dd force-pushed the vladimir-dd/metrics-v2 branch 2 times, most recently from 8148cb3 to 9bb9b2b Compare March 5, 2026 17:26
Add regression test to validate datadog_metrics sink v2 endpoint
performance under realistic high-throughput DogStatsD load.

Test Configuration:
- Load: Default lading dogstatsd settings (realistic ~2KB messages)
- Throughput: 500 Mb/s → ~250k events/sec
- Batch: Default settings (100k max_events, 2s timeout)
- Validates batch splitting when payloads exceed v2 size limits

This test ensures v2 endpoint correctly handles batch splitting
with realistic high-cardinality DogStatsD metrics under load.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@vladimir-dd vladimir-dd force-pushed the vladimir-dd/metrics-v2 branch from 9bb9b2b to 0cb05d2 Compare March 6, 2026 06:40
Different series endpoints have different uncompressed payload limits (v2
is 12x smaller than v1). This ensures each batch fits in a single HTTP
request without splitting, reducing memory overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pront
Copy link
Member

pront commented Mar 6, 2026

Note: will unsubscribe from this PR until it's "ready for review". When it's ready, we will prioritize reviewing it over other PRs.

vladimir-dd and others added 3 commits March 9, 2026 14:46
Adds two scripts for benchmarking the Datadog metrics sink v1 vs v2
endpoint performance:

- scripts/generate_statsd_load.py: StatsD UDP load generator with
  configurable rate, metric count, tag cardinality, and high-cardinality
  tag support.

- scripts/benchmark_dd_metrics_v1_v2.py: Orchestrates back-to-back v1/v2
  runs, collects throughput/CPU/RSS/HTTP-req metrics from Vector's
  Prometheus endpoint, and prints a comparison summary. Supports
  --batch-max-bytes-v1 / --batch-max-bytes-v2 to test with explicit
  per-endpoint byte limits matching the API uncompressed payload limits
  (v1: 60 MiB, v2: 5 MiB).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Results from benchmarking the per-endpoint batch size fix against the
pre-fix binary at 50k/s and 200k/s with 10 metrics and ~46 tags per
event (~932 bytes statsd wire size).

Key findings:
- Pre-fix, no limits: v2 uses +70% more RSS than v1 (5143 MB vs 3019 MB)
- Post-fix, no limits: v2 uses +23% more RSS (fix auto-applies 5 MiB cap)
- Post-fix, 200k/s: v2 uses -7% less RSS and -20% less CPU than v1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… config

- Remove references to DatadogMetricsCompression and request_compression
  from encoder.rs tests (those symbols don't exist in current codebase;
  they belong to an unmerged compression-options branch)
- Fix batcher_user_max_bytes_is_preserved test to avoid struct update
  syntax with private PhantomData fields in BatchConfig

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

## Setup

- **Branch**: `vladimir-dd/metrics-v2` (includes per-endpoint batch size fix)

Check failure

Code scanning / check-spelling

Unrecognized Spelling Error

vladimir is not a recognized word. (unrecognized-spelling)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: sinks Anything related to the Vector's sinks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants