This document contains some guidelines for developing tatami libraries. I made this up as I went along so not all libraries are fully compliant, though I try my best to update things to meet these (shifting) standards.
Types should be named with PascalCase
.
Variables and functions should be named with snake_case
.
Template parameters should have a trailing underscore to distinguish them from non-template parameters. This occurs regardless of whether a template parameter is a type.
Classes should not have any public members.
Access to members should be granted via getters or setters.
All members should be prefixed by my_
to avoid confusion with arguments in the constructor and initialization list.
Structs should have public members only, with no private members or methods. These are treated as passive data carriers with no additional functionality. A non-default constructor is allowed, though.
(Some of these ideas were taken from Google's C++ style guide.)
Template parameters should only contain types or flags that influence the type, see the logic described below.
Class templates are more limited than their function counterparts in terms of argument deduction and the positioning of defaults (at least as of C++17). To avoid an overly frustrating user experience, we allow default arguments for template parameters that could have been deduced from a constructor. This is a rather reluctant compromise, as the presence of defaults increases the risk of strange type mismatch errors or implicit conversions. However, it is more palatable when the default is the base class of all possible input types, where mismatches and conversions are no longer a concern. This pattern is occasionally used to offer users a choice between run-time and compile-time polymorphism.
template<typename Thing_ = Base>
class Foo {
public:
Foo(Thing_ x) {}
};
Foo<> x(Base());
Foo<> y(Derived1());
Foo<> z(Derived2());
Foo<Derived2> z(Derived2()); // potential devirtualization.
Functions should not have default arguments.
Instead, functions should accept an Options
struct that contains all arguments with their default values.
This avoids confusion when a caller needs to specify a subset of those arguments or if we need to add more options.
Users willing to accept all default arguments can simply call the function with {}
, which is convenient enough that no overload is needed.
If an argument is optional but has no sensible default, it should be wrapped in a std::optional
.
This is more explicit than using a sentinel value (e.g., -1 for unsigned integers) and more concise than passing in a separate bool
for this purpose.
Default values for integer arguments should consider using sanisizer::cap()
to ensure that the value is below the maximum guaranteed by the standard for that integer type.
For example, a std::size_t
is only guaranteed to store up to 65535, so setting a greater default value could result in compiler warnings on some unusual implementations.
This scenario is not uncommon for setting buffer sizes where a larger default value is preferred for performance on typical systems but lies beyond the standard-guaranteed upper limit.
Template parameters should only contain types or flags that influence the type, e.g., sparse_
, oracle_
.
It is tempting to move other parameters into the template arguments to convert run-time checks into compile-time checks for efficiency.
However, this is not ergonomic if those parameters are only known at run-time, as callers must manually write if
statements to switch between instances of the template function.
I also suspect that this will bloat the binaries, especially if many combinations of arguments must be considered at compile-time.
Thus, non-type-related template parameters should be used sparingly - usually only in performance-critical loops.
Template functions should refrain from adding defaults for their template parameters. Most template arguments can be deduced at the call site so the corresponding defaults would not be used anyway. If deduction fails, the presence of defaults can result in confusing errors when the compiler attempts to match the supplied type to the default (or worse, silently perform a conversion). The obvious exception is that of a template parameter that defines an non-deducible output type, in which case a "sensible" default may be convenient and self-documenting.
In the template parameter list, non-deducible template parameters without defaults should always appear first; followed by the non-deducible parameters with defaults; and finally, all deducible parameters, in the order in which they appear in the types of the function arguments. This approach ensures that the non-deducible parameters can be easily specified by the user.
Lambdas should be written in their full form, as shown below:
[...] (...) -> return_type {
/* body */
}
For lambdas that have no return value, a void
return should be explicitly specified.
Similarly, lambdas that accept no arguments should still be written with ()
.
The goal is to more clearly distinguish lambda definitions from IIFEs.
Speaking of which, IIFEs should be written with their "canonical" form:
[...]{
/* body */
}()
Developers should keep in mind that IIFEs use auto
type deduction rules for the return type.
This means that they will never return a reference type, which can be unexpected when using an IIFE to replace ternary expressions, e.g., for multi-line expressions or compile-time switches.
In these (relatively rare) cases, we should instead write out a full lambda and invoke it:
const auto& x = [...]() -> const auto& {
/* body */
}();
Two-dimensional accesses typically involve the following pattern, e.g., for row-major matrices:
ptr[r * nc + c]; // gets matrix entry (r, c)
If the integer type (typically promoted to int
) is not large enough to hold the product, our code will be incorrect due to the overflow.
Developers should coerce all operands to the array's size type before any arithmetic to ensure that no overflow occurs.
This is most easily done with sanisizer::nd_offset()
:
// Equivalent to ptr[c + nc * r] after casting all values to std::size_t.
ptr[sanisizer::nd_offset<std::size_t>(c, nc, r)];
If we are allocating contiguous memory to hold a high-dimensional array, we should use sanisizer::product()
to check that the product does not overflow the allocation's size type.
Subsequent accesses into this allocation (e.g., via sanisizer::nd_offset()
or friends) can omit this check as long as they are referencing valid positions.
typedef std::vector<double> Buffer;
Buffer buffer(sanisizer::product<typename Buffer::size_type>(nr, nc));
Merely constructing (not dereferencing!) of a pointer beyond one-past-the-end of an array is undefined behavior. Such a pointer can be inadvertently constructed when performing strided access in the middle of the matrix. For example, given a row-major matrix:
int col = 5;
auto adj_ptr = ptr + col;
for (int r = 0; r < nr; ++r) {
row_contents[r] = *adj_ptr;
adj_ptr += nc;
}
After the last iteration, adj_ptr
will be equal to ptr + nr * nc + col
, which is not valid.
A slightly better approach is to use offsets in the loop body to ensure that an invalid pointer is never constructed.
int col = 5;
std::size_t offset = col; // Make sure it's size_t to avoid overflow
for (int r = 0; r < nr; ++r) {
row_contents[r] = ptr[offset];
offset += nc;
}
Unfortunately, the addition in the last iteration could overflow the offset type.
This is technically fine for size_t
but could be undefined behavior otherwise, even though the offset is not used after the loop.
So, the safest approach is to just compute the 2-dimensional index directly in the loop body.
This ensures that only valid offsets are ever computed, assuming that the loop stops at the end of the matrix.
int col = 5;
for (int r = 0; r < nr; ++r) {
// Compute offset as 'col + nc * r' inside the loop.
row_contents[r] = ptr[sanisizer::nd_offset<std::size_t>(col, nc, r)];
}
Modern compilers (well, Clang and GCC, at least) can optimize out the multiplication so there is no performance penalty compared to incrementing an offset. This approach is easier to reason about and is more amenable to vectorization as there are no dependencies in the loop body.
Some functions store their output in a user-supplied pointer to a pre-allocated array. Such functions should not assume that the array has been zeroed. Incorrectly making this assumption can lead to subtle bugs when the functions are called with non-zeroed buffers.
To test that this assumption is not present in function X()
, my test suites fill the output buffer with a non-zero initial value from the initial_value()
function below.
After calling X()
with output.data()
, we can compare output
against the expected output to verify that the function ignores the non-zero initial value.
If the expected output is not readily available, we could instead initialize a separate vector with another call to initial_value()
and call X()
on it.
Any inconsistency in the results would indicate that the variable initial_value()
is affecting the results.
// Taken from https://github.com/libscran/scran_tests:
inline int initial_value() {
static int counter = 0;
if (counter == 255) { // fits in all numeric types.
counter = 1;
} else {
++counter;
}
return counter;
}
// Initializing the output buffer.
std::vector<int> output(n, initial_value());
The real utility of initial_value()
lies in its ability to be inserted into function overloads that return std::vector
objects.
This is convenient as simply calling the vector-returning overload is sufficient to test all aspects of the base function, including independence from the initial value of the input buffer.
(Otherwise, the vector-returning overload would always zero-initialize its vector and mask potential dependencies on the input buffer in the base function.)
// Defined for test binaries only.
#define TEST_INIT_VALUE initial_value()
// "Base" function that accepts a pointer.
void X(int n, double* output);
// Overload that returns a std::vector.
std::vector<double> X(int n) {
std::vector<double> output(n
#ifdef TEST_INIT_VALUE
, TEST_INIT_VALUE
#endif
);
X(n, output.data());
return output;
}
Historically, floating-point calculations were parallelized in a manner that ensures that the number of workers does not affect the result. This requires some care in designing algorithms to ensure the exact same result was obtained during floating-point calculations. The aim was to allow users to freely alter the number of threads without compromising reproducibility. This is most relevant when dealing with downstream algorithms based on ranks, where slight floating-point errors can lead to discrete changes in rank.
With the benefit of hindsight, this policy is probably unnecessary.
- Most users will not change the number of threads after their analysis, so they won't even know that there is a difference due to the number of threads.
- The main reason to change the number of threads is to account for differences in hardware. At that point, it is likely that a different CPU is involved, at which point exact floating-point reproducibility is no longer guaranteed.
- This policy is inconsistent with the acceptance of different results between matrix representations due to sparsity or row/column preferences. The obvious candidate is that of the per-row/column variances, and no one has complained so far.
All future algorithmic choices in tatami can disregard this policy. This allows us to use more efficient methods for partitioning work between cores. Note that the differences in results due to parallelization should be limited to floating-point error, i.e., the result should be the same under exact math.
False sharing can be mostly avoided by writing code in a manner that uses automatic variables allocated on each thread's stack. This is good practice regardless of whether we're in a multi-threaded context.
If threads need to write to the heap, false sharing can be mitigated by creating separate heap allocations within each thread, e.g., by constructing a thread-specific std::vector
.
This gives malloc
a chance to use different memory arenas (depending on the implementation) that separates each thread's allocations.
We could guarantee protection against false sharing by aligning all heap allocations to hardware_destructive_interference_size
;
but then we would need to override std::allocator
in all STL containers, which seems too intrusive for general use.
Obviously, writing to contiguous memory in the heap across multiple threads should be done sparingly, typically to the output buffer once all calculations are complete.
An interesting sidenote is that each thread will typically create its own tatami::Extractor
instance on the heap.
Modifications to data members of each tatami::Extractor
instance during fetch()
calls could be another source of false sharing.
This might be avoidable by adding an alignas(hardware_destructive_interference_size)
to the class definition,
but this is only supported by very recent compilers and I'm not even sure if it's legal when the object's natural alignment is stricter;
so, we'll just stick to our ostrich strategy of hoping that malloc
takes care of it.
Ah, vectorization: the policy with the most flip-flopping in this guide.
Most of the loops in tatami are trivially auto-vectorizable so no further work is necessary when compiled with the relevant flags, e.g., -O3
, -march=
.
There might be a slight inefficiency due to the need to check against aliasing pointers, but this is fine;
the check is performed outside of the loop and should not have much impact on performance.
On the rare occasion that the check is run frequently relative to the loop body, branch prediction should eliminate most of the cost for the expected case of non-aliasing pointers.
Auto-vectorization also tends to use unaligned SIMD loads and stores, but it's hard to imagine doing otherwise in library code that might take pointers from anywhere (e.g., R).
Fortunately, it seems like unaligned instructions are pretty performant on modern CPUs,
see commentary here
and here.
That said, there are two common scenarios where auto-vectorization is not possible.
-
Accessing a sparse vector by index. If the compiler cannot prove that the indices are unique, different loop iterations are not guaranteed to be independent. Here, I went through many possibilities to inform the compiler that vectorization can be performed.
- Using OpenMP SIMD to force vectorization.
While this works, it is often subject to a more heavy-handed interpretation by compilers.
Upon seeing
#pragma omp simd
, GCC will forcibly vectorize the loop, even if doing so would decrease performance according to its cost model. MSVC goes further and enables fast floating-point inside the loop, which is not generally desirable. - Encouraging vectorization with compiler-specific pragmas to indicate that there are no dependencies between loop iterations,
e.g.,
#pragma GCC ivdep
or#pragma clang loop vectorize(assume_safety)
. This also works fairly well but it's hard to keep track of exactly how each compiler is interpreting these pragmas. Are all dependencies ignored? Or just the non-proven ones? Needless to say, these pragmas are non-portable and require careful testing on each platform.
In the end, I suspect that it's not worth the effort to attempt to optimize this access pattern. Plenty of excuses here: lack of efficient gather/scatter for many CPUs, likely memory bottlenecks in sparse data, etc.
- Using OpenMP SIMD to force vectorization.
While this works, it is often subject to a more heavy-handed interpretation by compilers.
Upon seeing
-
Calling
<cmath>
functions inside the loop, primarily for the delayed math operations. Vectorization is precluded by theerrno
side-effect and, for some functions likestd::log
, the need for-ffast-math
to use vectorized replacements. While compiling with-fno-math-errno
seems fine, using fast math in general is obviously a less palatable option and I can't cajole GCC into using the (unexported) vectorizedlog
function without it. So, the solution is to just write a variant of a delayed operation helper that uses a vectorizedlog
implementation from an external library like Eigen. Then, users can choose between the standard helper or the vectorized variant that requires an extra dependency.
The standard only provides weak guarantees on the size of each integer, e.g., int
is only guaranteed to be no less than 16 bits.
If a larger minimum size is needed, consider using long
or long long
.
If a precise integer size is needed, consider using some of the types from <cstdint>
.
(A precise type is more predictable across platforms but may always not be available.)
If any of the fixed-size integers or size_t
are used, they should be namespaced with std::
.
The <cstddef>
header should also be imported when using std::size_t
, so as to avoid relying on a compiler-dependent alias.
std::size_t
is the most appropriate type for indexing arrays via pointers as it is the size of the largest array.
That said, we should try to avoid spamming std::size_t
everywhere as it is not guaranteed to be able to represent other integers.
For example, an implementation could define a 16-bit std::size_t
and 32-bit int
.
In general, the most appropriate integer type should be chosen to avoid casts and the associated risks of overflow/wrap-around.
In tatami, the Index_
template parameter is used to represent the row and column indices.
We expect that the user's choice of Index_
is large enough to store the extents of both dimensions, i.e., the number of rows/columns.
This also means that Index_
is often a good choice for internal calculations that involve a single dimension,
as the results of such calculations are usually non-negative and less than the extent.
It can also be safely used to represent 1-based indices as all indices will still be no greater than the dimension extent after incrementing.
Additionally, we expect that the dimension extents will fit into a std::size_t
.
This is because tatami::fetch()
accepts a pointer to an array in which the matrix values might be stored, and addressing the elements of an array requires a std::size_t
.
Under this assumption, indices and extents that are typically stored as Index_
can be safely converted to std::size_t
without wrap-around, even if Index_
is of a larger type.
Indexed iteration over an arbitrary container x
should use decltype(x.size())
.
This ensures that the index type is large enough to span the size of the container.
Of course, this is only relevant if the size of the container is not subject to other constraints -
for example, if we know that the container has length equal to the number of rows, and the number of rows fits into an int
, it is obviously safe to use int
to iterate over the container.
auto n = x.size();
for (decltype(n) i = 0; i < n; ++i) {
// do something here.
}
When allocating or resizing a container, we must consider the possibility that the new length is greater than the container's size type. If the implicit cast wraps around, we would silently create a vector that is smaller than intended. This can be prevented by using sanisizer functions to check for wrap-around:
x.resize(sanisizer::cast<decltype(x.size())>(new_length));
// For new objects:
auto x2 = sanisizer::create<std::vector<double> >(new_length);
In the specific case of resizing to a dimension extent in tatami, we could instead use some specialized wrappers around the sanisizer functions.
These implement some optimizations by exploiting the assumption that extents can be safely cast to std::size_t
.
tatami::resize_container_to_Index_size(x, new_length);
auto x2 = tatami::create_container_of_Index_size(x, new_length);
This protection may be omitted for calls to a container's reserve()
method, if one exists.
A smaller-than-expected reservation from wrap-around is mostly harmless, as it will be expanded upon insertion.
(Nonetheless, performance degradation can be avoided by using sanisizer::cap()
to attempt to reserve up to the maximum value of the size type.)
Insertions beyond the container's size type will eventually result in std::bad_alloc
.
For STL vectors with the default allocator, we can be fairly confident that size_type
is no greater than std::size_t
in any hypothetical implementation.
The requirements of std::vector::data()
demand that the vector holds its data in an underlying array, which must have a maximum size representable by std::size_t
.
Moreover, a call to the default allocator must yield an array of length that fits inside std::size_t
,
and it would be rather unusual - perhaps impossible? - for a std::vector
implementation to perform multiple allocations to form a region of contiguous storage greater than std::size_t
.
In addition, it is most pragmatic to assume that std::size_t
is sufficient to represent the length of STL vectors as it simplifies the overloads.
Many of my functions will accept an arbitrary pointer of input/output data so that they can be used with pointers from new
or arrays from other langauges' FFIs.
If we assume that std::size_t
is enough, the same function can be used with std::vector::data
and the std::vector
overload can be a simple wrapper.
An annoying quirk of C++ is that distances between pointers for the same array (and possibly iterator differences in general) are not guaranteed to fit into the difference type.
For example, std::ptrdiff_t
is a typically the signed counterpart of std::size_t
, and an array size that fits in the latter may not fit in the former.
This annoyance is compounded by the fact that many of the STL algorithms return iterators, e.g., std::max_element
, std::lower_bound
.
If we need a positional index, we need to convert the returned iterator back to an index via subtraction.
We can do so safely using the sanisizer::ptrdiff()
function:
std::vector<double> x(vec_length);
sanisizer::can_ptrdiff<decltype(x.begin())>(vec_length);
auto max_it = std::max_element(x.begin(), x.end());
auto max_idx = max_it - x.begin(); // known to be safe after can_ptrdiff().
In practice, this may not be necessary as glibc 2.30 prohibits allocations beyond the maximum size of PTRDIFF_MAX
.
(Presumably the same restrictions apply for std::vector
.)
However, this is a implementation choice, and the standard explicitly mentions that the pointer difference may not fit in std::ptrdiff_t
in a conforming implementation!
So, better to be safe than sorry.