You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the old metrics system, to be precise, the counter is not defined independently as a metric type, though it could be implemented by dsn::perf_counter_number_atomic or dsn::perf_counter_volatile_number_atomic.
Using thread ID as the hash code, dsn::perf_counter_number_atomic has tried to distribute the atomic increment evenly over the DIVIDE_CONTAINER. Each element in DIVIDE_CONTAINER is a 8-byte std::atomic<int64_t>. The final value of the counter can be got by summing all the elements of DIVIDE_CONTAINER.
However, threads run on different cpus are likely to update the data within the same cache line (typically 64 bytes for x86). This will lead to false sharing problem which will degrade performance. In comparison with a single std::atomic<int64_t> as the counter, it consumes more memory while sometimes getting even worse performance.
Therefore, in new metrics system, we need new implementations that eliminate false sharing, to improve the performance and consume less memory, if possible.
2. Implementation
2.1 concurrent_long_adder
To eliminate false sharing, first we can introduce a new struct that aligns a std::atomic<int64_t> with a cache line. Then, allocate an array of the new structs in an aligned region of memory. The size of this array is equal to the number of cpus.
Since threads run on different cpus are more likely to be hashed to the different cache line of this array, the probability of false sharing can be significantly reduced. The implementation based on this idea is called concurrent_long_adder.
concurrent_long_adder achieves high performance at the cost of more memory. The more cpu cores the machines has, the more memory it will consumed. The relationship between the cpu cores and memory usage is as below:
The number of cpus
Memory usage
2
128 bytes
4
256 bytes
8
512 bytes
16
1024 bytes
32
2048 bytes
Since dsn::perf_counter_number_atomic will consume 856 bytes, if the machine has less than 16 cores, concurrent_long_adder will consume less memory than dsn::perf_counter_number_atomic.
Nevertheless, due to the memory consumption, concurrent_long_adder is recommended to be used under the scenario that a counter is updated very frequently.
2.2 striped_long_adder
Inspired by apache/kudu@0123322, striped_long_adder is implemented to trade off between the performance and memory usage.
The idea of this long adder is that it has a base 8-byte std::atomic<int64_t>. If the long adder is not updated frequently, it will not consume extra memory; Otherwise, once collision is detected, it will allocate the same memory bytes that concurrent_long_adder consumes in case the performance is degraded.
Since more instructions are used to detect the collision, striped_long_adder is slower than concurrent_long_adder. However, it can balance between the performance and memory usage. Therefore, striped_long_adder is used as the default long adder and is appropriate for the scenario that a counter is not updated frequently.
3. Usage
For the reason that virtual function will consume extra memory and slow down the execution, template is used to wrap the long adder to provide the unified API. For details please see long_adder_wrapper in include/dsn/utility/long_adder.h.
4. Performance
The performance test is quite simple: a number of threads is started, and each thread execute a number of atomic increment executions. Besides striped_long_adder and concurrent_long_adder, another 2 kinds of long adders are introduced to compare the execution performance.
simple_long_adder is just a wrapper of a single std::atomic<int64_t>. divided_long_adder is nearly the same with dsn::perf_counter_number_atomic except that virtual functions are removed.
Through perf_counter may not cost lots of time in our flame graph(capture in the read Scenes).
I think it's still worth optimizing it.
Thanks for providing such a cool flame graph !
Yeah I think currently most of our tasks that are sampled by the counter will be executed during several milliseconds or more: it may consume lots of memory, and involve network transmission or disk operations. If L1 cache is not large enough, it's likely that the counter has been evicted from L1 cache and must be loaded from memory again. This will not lead to false sharing.
On the other hand, since we will adopt striped_long_adder based on striped64 as the default implementation for the counter, it will reduce the memory usage: a long duration will greatly reduce the possibility of collision, thus a single 8-byte atomic<int64_t> is usually enough; by contrast, dsn::perf_counter_number_atomic will consume 856 bytes.
1. Motivation
In the old metrics system, to be precise, the counter is not defined independently as a metric type, though it could be implemented by
dsn::perf_counter_number_atomic
ordsn::perf_counter_volatile_number_atomic
.Using thread ID as the hash code,
dsn::perf_counter_number_atomic
has tried to distribute the atomic increment evenly over theDIVIDE_CONTAINER
. Each element inDIVIDE_CONTAINER
is a 8-bytestd::atomic<int64_t>
. The final value of the counter can be got by summing all the elements ofDIVIDE_CONTAINER
.However, threads run on different cpus are likely to update the data within the same cache line (typically 64 bytes for x86). This will lead to false sharing problem which will degrade performance. In comparison with a single
std::atomic<int64_t>
as the counter, it consumes more memory while sometimes getting even worse performance.Therefore, in new metrics system, we need new implementations that eliminate false sharing, to improve the performance and consume less memory, if possible.
2. Implementation
2.1 concurrent_long_adder
To eliminate false sharing, first we can introduce a new struct that aligns a
std::atomic<int64_t>
with a cache line. Then, allocate an array of the new structs in an aligned region of memory. The size of this array is equal to the number of cpus.Since threads run on different cpus are more likely to be hashed to the different cache line of this array, the probability of false sharing can be significantly reduced. The implementation based on this idea is called
concurrent_long_adder
.concurrent_long_adder
achieves high performance at the cost of more memory. The more cpu cores the machines has, the more memory it will consumed. The relationship between the cpu cores and memory usage is as below:Since
dsn::perf_counter_number_atomic
will consume 856 bytes, if the machine has less than 16 cores,concurrent_long_adder
will consume less memory thandsn::perf_counter_number_atomic
.Nevertheless, due to the memory consumption,
concurrent_long_adder
is recommended to be used under the scenario that a counter is updated very frequently.2.2 striped_long_adder
Inspired by apache/kudu@0123322,
striped_long_adder
is implemented to trade off between the performance and memory usage.The idea of this long adder is that it has a base 8-byte
std::atomic<int64_t>
. If the long adder is not updated frequently, it will not consume extra memory; Otherwise, once collision is detected, it will allocate the same memory bytes thatconcurrent_long_adder
consumes in case the performance is degraded.Since more instructions are used to detect the collision,
striped_long_adder
is slower thanconcurrent_long_adder
. However, it can balance between the performance and memory usage. Therefore,striped_long_adder
is used as the default long adder and is appropriate for the scenario that a counter is not updated frequently.3. Usage
For the reason that virtual function will consume extra memory and slow down the execution, template is used to wrap the long adder to provide the unified API. For details please see
long_adder_wrapper
ininclude/dsn/utility/long_adder.h
.4. Performance
The performance test is quite simple: a number of threads is started, and each thread execute a number of atomic increment executions. Besides
striped_long_adder
andconcurrent_long_adder
, another 2 kinds of long adders are introduced to compare the execution performance.simple_long_adder
is just a wrapper of a singlestd::atomic<int64_t>
.divided_long_adder
is nearly the same withdsn::perf_counter_number_atomic
except that virtual functions are removed.4.1 Hardware configurations
4.2 Test results
4.2.1 2 threads with each 1,000,000,000 operations
4.2.2 4 threads with each 1,000,000,000 operations
4.2.3 8 threads with each 1,000,000,000 operations
4.2.4 16 threads with each 1,000,000,000 operations
The text was updated successfully, but these errors were encountered: