Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric refactor - 2x perf and allocation free #1989

Merged

Conversation

cijothomas
Copy link
Member

@cijothomas cijothomas commented Aug 6, 2024

Part 2 of #1740, addressing 1,2,3. Also closes #1954 for SUM.

Refactoring SUM aggregation logic to operate entirely on user provided slice of KeyValues. In hot-path, we avoid allocating, copying, sorting, de-duplicating., leading to 2X or more gains (with 4 attributes. More attributes = much more gains), as long as user do not keep changing the order of attributes.

This is mostly based on the prototypes did outside this repo in https://github.com/cijothomas/metrics-mini/tree/main/metrics
Once the approach is settled, we can replicate to all aggregations. (This PR is only fixing SUM)

Benchmarks:

Counter_Add_Sorted      time:   [192.64 ns 193.03 ns 193.48 ns]
                        change: [-66.328% -66.034% -65.769%] (p = 0.00 < 0.05)
                        Performance has improved.
Counter_Add_Unsorted    time:   [209.09 ns 209.41 ns 209.74 ns]
                        change: [-63.198% -62.905% -62.670%] (p = 0.00 < 0.05)
                        Performance has improved.
Counter_Overflow        time:   [895.61 ns 898.58 ns 902.11 ns]
                        change: [+54.000% +58.218% +61.362%] (p = 0.00 < 0.05)
                        Performance has regressed.

Stress Test also shows small gains (17M/sec to 20M/sec). #1833 fixed the major contention, so this PR has limited impact on stress test, but the raw speed is significantly improved.

Stress Test with memory stats:

Main:
Throughput: 17,842,800 iterations/sec
Memory usage: 5.06 MB
CPU usage: 98.61875%
Virtual memory usage: 2249.73 MB

PR:
Throughput: 18,869,600 iterations/sec
Memory usage: 4.23 MB
CPU usage: 99.2016%
Virtual memory usage: 2249.55 MB

Visible memory reduction as well from ~5 MB to ~4.2 MB

note: Overflow perf regressed, and that is acceptable. The main goal with overflow feature is to protect SDK from consuming unlimited memory, so it is acceptable that this path is slower. This is rare case, and indicates user needs to adjust cardinality.

}
}
}

impl<T: Number<T>> ValueMap<T> {
fn measure(&self, measurement: T, attrs: AttributeSet) {
fn measure(&self, measurement: T, attrs: &[KeyValue]) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key - we operate on incoming ref/slice itself without allocating, copying, sorting, de-duplicating. If the look up fails, the "slow" path will do everything as before. The net effect is, for users who use metrics API normally (i.e without duplicates, and without changing the order of attributes), we'll have highest performance and zero heap allocations.

has_no_value_attribute_value: AtomicBool,
no_attribute_value: T::AtomicTracker,
count: AtomicUsize,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count is tracked separately, as number of entries in HashMap is no longer accurate, as we insert original+sorted in hashmap.


/// The storage for sums.
struct ValueMap<T: Number<T>> {
values: RwLock<HashMap<AttributeSet, T::AtomicTracker>>,
values: RwLock<HashMap<Vec<KeyValue>, Arc<T::AtomicTracker>>>,
Copy link
Member Author

@cijothomas cijothomas Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric value is Arc shared to ensure attributes provided in any order results in contributing to same timeseries. Arc equality is also leveraged in export time to de-dup entries.

Copy link

codecov bot commented Aug 6, 2024

Codecov Report

Attention: Patch coverage is 95.34884% with 6 lines in your changes missing coverage. Please review.

Project coverage is 75.1%. Comparing base (76b40ba) to head (246ee17).

Files Patch % Lines
opentelemetry-sdk/src/metrics/internal/sum.rs 92.0% 6 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff          @@
##            main   #1989   +/-   ##
=====================================
  Coverage   75.0%   75.1%           
=====================================
  Files        122     122           
  Lines      20562   20633   +71     
=====================================
+ Hits       15430   15497   +67     
- Misses      5132    5136    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

value: value.get_value(),
exemplars: vec![],
});
if seen.insert(Arc::as_ptr(&value)) {
Copy link
Member Author

@cijothomas cijothomas Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dedup sorted+original attributes!
@utpilla I leveraged Arc pointer to de-dup. Storing metric points in a different array was tried as well, but this looks easier to me.

@cijothomas cijothomas marked this pull request as ready for review August 6, 2024 22:11
@cijothomas cijothomas requested a review from a team August 6, 2024 22:11
@@ -925,6 +925,62 @@ mod tests {
assert_eq!(data_point1.value, 6);
}

#[tokio::test(flavor = "multi_thread", worker_threads = 1)]
async fn counter_aggregation_attribute_ordering() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this to investigate some issue, that was false alarm. Keeping this test for now. It is not that useful as-is, but can be augmented in subsequent PRs to make sure we cover all scenarios via testing.

let n = values.len() + 1;
// Max number of data points need to account for the special casing
// of the no attribute value + overflow attribute.
let n = self.value_map.count.load(Ordering::SeqCst) + 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this calculation is important for correctness of the export. If it is, then we should do this under the self.value_map.values.write() lock. Otherwise, we read a certain value here and before we reserve capacity for data points, some update thread(s) could enter the write lock and increment the count further.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not look important for correctness. Even if it is off by few, it just results in vec resizing in the middle of collect.
We can revisit this.

Copy link
Contributor

@utpilla utpilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OTel Metrics with no heap allocation
2 participants