Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: implement long adder to optimize the counter of new metrics system #889

Closed
Tracked by #922
empiredan opened this issue Jan 20, 2022 · 2 comments
Closed
Tracked by #922
Labels
type/enhancement Indicates new feature requests

Comments

@empiredan
Copy link
Contributor

empiredan commented Jan 20, 2022

1. Motivation

In the old metrics system, to be precise, the counter is not defined independently as a metric type, though it could be implemented by dsn::perf_counter_number_atomic or dsn::perf_counter_volatile_number_atomic.

Using thread ID as the hash code, dsn::perf_counter_number_atomic has tried to distribute the atomic increment evenly over the DIVIDE_CONTAINER. Each element in DIVIDE_CONTAINER is a 8-byte std::atomic<int64_t>. The final value of the counter can be got by summing all the elements of DIVIDE_CONTAINER.

However, threads run on different cpus are likely to update the data within the same cache line (typically 64 bytes for x86). This will lead to false sharing problem which will degrade performance. In comparison with a single std::atomic<int64_t> as the counter, it consumes more memory while sometimes getting even worse performance.

Therefore, in new metrics system, we need new implementations that eliminate false sharing, to improve the performance and consume less memory, if possible.

2. Implementation

2.1 concurrent_long_adder

To eliminate false sharing, first we can introduce a new struct that aligns a std::atomic<int64_t> with a cache line. Then, allocate an array of the new structs in an aligned region of memory. The size of this array is equal to the number of cpus.

Since threads run on different cpus are more likely to be hashed to the different cache line of this array, the probability of false sharing can be significantly reduced. The implementation based on this idea is called concurrent_long_adder.

concurrent_long_adder achieves high performance at the cost of more memory. The more cpu cores the machines has, the more memory it will consumed. The relationship between the cpu cores and memory usage is as below:

The number of cpus Memory usage
2 128 bytes
4 256 bytes
8 512 bytes
16 1024 bytes
32 2048 bytes

Since dsn::perf_counter_number_atomic will consume 856 bytes, if the machine has less than 16 cores, concurrent_long_adder will consume less memory than dsn::perf_counter_number_atomic.

Nevertheless, due to the memory consumption, concurrent_long_adder is recommended to be used under the scenario that a counter is updated very frequently.

2.2 striped_long_adder

Inspired by apache/kudu@0123322, striped_long_adder is implemented to trade off between the performance and memory usage.

The idea of this long adder is that it has a base 8-byte std::atomic<int64_t>. If the long adder is not updated frequently, it will not consume extra memory; Otherwise, once collision is detected, it will allocate the same memory bytes that concurrent_long_adder consumes in case the performance is degraded.

Since more instructions are used to detect the collision, striped_long_adder is slower than concurrent_long_adder. However, it can balance between the performance and memory usage. Therefore, striped_long_adder is used as the default long adder and is appropriate for the scenario that a counter is not updated frequently.

3. Usage

For the reason that virtual function will consume extra memory and slow down the execution, template is used to wrap the long adder to provide the unified API. For details please see long_adder_wrapper in include/dsn/utility/long_adder.h.

4. Performance

The performance test is quite simple: a number of threads is started, and each thread execute a number of atomic increment executions. Besides striped_long_adder and concurrent_long_adder, another 2 kinds of long adders are introduced to compare the execution performance.

simple_long_adder is just a wrapper of a single std::atomic<int64_t>. divided_long_adder is nearly the same with dsn::perf_counter_number_atomic except that virtual functions are removed.

4.1 Hardware configurations

cpu: Intel Xeon Processor (Cascadelake)
cores: 4
memory: 32GB

4.2 Test results

4.2.1 2 threads with each 1,000,000,000 operations

The type of long adder Duration(seconds)
simple_long_adder 41.8089
divided_long_adder 45.3746
striped_long_adder 34.4589
concurrent_long_adder 11.9191

4.2.2 4 threads with each 1,000,000,000 operations

The type of long adder Duration(seconds)
simple_long_adder 94.9675
divided_long_adder 101.167
striped_long_adder 36.1651
concurrent_long_adder 12.3893

4.2.3 8 threads with each 1,000,000,000 operations

The type of long adder Duration(seconds)
simple_long_adder 183.897
divided_long_adder 185.945
striped_long_adder 70.5208
concurrent_long_adder 47.1702

4.2.4 16 threads with each 1,000,000,000 operations

The type of long adder Duration(seconds)
simple_long_adder 334.987
divided_long_adder 326.552
striped_long_adder 141.043
concurrent_long_adder 92.6725
@Smityz
Copy link
Contributor

Smityz commented Jan 24, 2022

Through perf_counter may not cost lots of time in our flame graph(capture in the read Scenes).
cp

I think it's still worth optimizing it.

@empiredan
Copy link
Contributor Author

Through perf_counter may not cost lots of time in our flame graph(capture in the read Scenes). cp

I think it's still worth optimizing it.

Thanks for providing such a cool flame graph !

Yeah I think currently most of our tasks that are sampled by the counter will be executed during several milliseconds or more: it may consume lots of memory, and involve network transmission or disk operations. If L1 cache is not large enough, it's likely that the counter has been evicted from L1 cache and must be loaded from memory again. This will not lead to false sharing.

On the other hand, since we will adopt striped_long_adder based on striped64 as the default implementation for the counter, it will reduce the memory usage: a long duration will greatly reduce the possibility of collision, thus a single 8-byte atomic<int64_t> is usually enough; by contrast, dsn::perf_counter_number_atomic will consume 856 bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

3 participants