Feature: implement long adder to optimize the counter of new metrics system #889

empiredan · 2022-01-20T08:12:56Z

1. Motivation

In the old metrics system, to be precise, the counter is not defined independently as a metric type, though it could be implemented by dsn::perf_counter_number_atomic or dsn::perf_counter_volatile_number_atomic.

Using thread ID as the hash code, dsn::perf_counter_number_atomic has tried to distribute the atomic increment evenly over the DIVIDE_CONTAINER. Each element in DIVIDE_CONTAINER is a 8-byte std::atomic<int64_t>. The final value of the counter can be got by summing all the elements of DIVIDE_CONTAINER.

However, threads run on different cpus are likely to update the data within the same cache line (typically 64 bytes for x86). This will lead to false sharing problem which will degrade performance. In comparison with a single std::atomic<int64_t> as the counter, it consumes more memory while sometimes getting even worse performance.

Therefore, in new metrics system, we need new implementations that eliminate false sharing, to improve the performance and consume less memory, if possible.

2. Implementation

2.1 concurrent_long_adder

To eliminate false sharing, first we can introduce a new struct that aligns a std::atomic<int64_t> with a cache line. Then, allocate an array of the new structs in an aligned region of memory. The size of this array is equal to the number of cpus.

Since threads run on different cpus are more likely to be hashed to the different cache line of this array, the probability of false sharing can be significantly reduced. The implementation based on this idea is called concurrent_long_adder.

concurrent_long_adder achieves high performance at the cost of more memory. The more cpu cores the machines has, the more memory it will consumed. The relationship between the cpu cores and memory usage is as below:

The number of cpus	Memory usage
2	128 bytes
4	256 bytes
8	512 bytes
16	1024 bytes
32	2048 bytes

Since dsn::perf_counter_number_atomic will consume 856 bytes, if the machine has less than 16 cores, concurrent_long_adder will consume less memory than dsn::perf_counter_number_atomic.

Nevertheless, due to the memory consumption, concurrent_long_adder is recommended to be used under the scenario that a counter is updated very frequently.

2.2 striped_long_adder

Inspired by apache/kudu@0123322, striped_long_adder is implemented to trade off between the performance and memory usage.

The idea of this long adder is that it has a base 8-byte std::atomic<int64_t>. If the long adder is not updated frequently, it will not consume extra memory; Otherwise, once collision is detected, it will allocate the same memory bytes that concurrent_long_adder consumes in case the performance is degraded.

Since more instructions are used to detect the collision, striped_long_adder is slower than concurrent_long_adder. However, it can balance between the performance and memory usage. Therefore, striped_long_adder is used as the default long adder and is appropriate for the scenario that a counter is not updated frequently.

3. Usage

For the reason that virtual function will consume extra memory and slow down the execution, template is used to wrap the long adder to provide the unified API. For details please see long_adder_wrapper in include/dsn/utility/long_adder.h.

4. Performance

The performance test is quite simple: a number of threads is started, and each thread execute a number of atomic increment executions. Besides striped_long_adder and concurrent_long_adder, another 2 kinds of long adders are introduced to compare the execution performance.

simple_long_adder is just a wrapper of a single std::atomic<int64_t>. divided_long_adder is nearly the same with dsn::perf_counter_number_atomic except that virtual functions are removed.

4.1 Hardware configurations

cpu: Intel Xeon Processor (Cascadelake)
cores: 4
memory: 32GB

4.2 Test results

4.2.1 2 threads with each 1,000,000,000 operations

The type of long adder	Duration(seconds)
simple_long_adder	41.8089
divided_long_adder	45.3746
striped_long_adder	34.4589
concurrent_long_adder	11.9191

4.2.2 4 threads with each 1,000,000,000 operations

The type of long adder	Duration(seconds)
simple_long_adder	94.9675
divided_long_adder	101.167
striped_long_adder	36.1651
concurrent_long_adder	12.3893

4.2.3 8 threads with each 1,000,000,000 operations

The type of long adder	Duration(seconds)
simple_long_adder	183.897
divided_long_adder	185.945
striped_long_adder	70.5208
concurrent_long_adder	47.1702

4.2.4 16 threads with each 1,000,000,000 operations

The type of long adder	Duration(seconds)
simple_long_adder	334.987
divided_long_adder	326.552
striped_long_adder	141.043
concurrent_long_adder	92.6725

The text was updated successfully, but these errors were encountered:

Smityz · 2022-01-24T03:29:01Z

Through perf_counter may not cost lots of time in our flame graph(capture in the read Scenes).

I think it's still worth optimizing it.

empiredan · 2022-01-24T04:10:56Z

Through perf_counter may not cost lots of time in our flame graph(capture in the read Scenes).

I think it's still worth optimizing it.

Thanks for providing such a cool flame graph !

Yeah I think currently most of our tasks that are sampled by the counter will be executed during several milliseconds or more: it may consume lots of memory, and involve network transmission or disk operations. If L1 cache is not large enough, it's likely that the counter has been evicted from L1 cache and must be loaded from memory again. This will not lead to false sharing.

On the other hand, since we will adopt striped_long_adder based on striped64 as the default implementation for the counter, it will reduce the memory usage: a long duration will greatly reduce the possibility of collision, thus a single 8-byte atomic<int64_t> is usually enough; by contrast, dsn::perf_counter_number_atomic will consume 856 bytes.

empiredan added the type/enhancement Indicates new feature requests label Jan 20, 2022

empiredan mentioned this issue Jan 20, 2022

feat: implement long adder to optimize the counter of new metrics system XiaoMi/rdsn#1033

Merged

acelyc111 closed this as completed Jan 28, 2022

empiredan mentioned this issue Mar 4, 2022

Feature: implement the framework of the new metrics system #922

Open

15 tasks

empiredan mentioned this issue Mar 24, 2022

Feature: implement the counter #931

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: implement long adder to optimize the counter of new metrics system #889

Feature: implement long adder to optimize the counter of new metrics system #889

empiredan commented Jan 20, 2022 •

edited

Loading

Smityz commented Jan 24, 2022

empiredan commented Jan 24, 2022

Feature: implement long adder to optimize the counter of new metrics system #889

Feature: implement long adder to optimize the counter of new metrics system #889

Comments

empiredan commented Jan 20, 2022 • edited Loading

1. Motivation

2. Implementation

2.1 concurrent_long_adder

2.2 striped_long_adder

3. Usage

4. Performance

4.1 Hardware configurations

4.2 Test results

4.2.1 2 threads with each 1,000,000,000 operations

4.2.2 4 threads with each 1,000,000,000 operations

4.2.3 8 threads with each 1,000,000,000 operations

4.2.4 16 threads with each 1,000,000,000 operations

Smityz commented Jan 24, 2022

empiredan commented Jan 24, 2022

empiredan commented Jan 20, 2022 •

edited

Loading