Branch | Travis CI | AppVeyorCI |
---|---|---|
Master | ||
Develop |
- Abstract
- Design goals
- Features
- Integration
- Usage
- Build system
- Current state
- License
- Execute unit tests
ZACC is a human-readable and extensible computation abstraction layer. Using ZACC and ZACC build, you are able to write and compile code once and execute it on target machines, unleashing their potential.
It is still under development which is synchronous to cacophony development.
Feel free to report issues and bugs to the Issue tracker on GitHub
There a few SIMD Libraries available, like Eigen or Agner Fog's vector class library, each of them following the same goal: accelerate your algorithms by using SIMD instructions.
ZACC implementation had these goals:
- Coding as if you would write vanilla C++.
std::cout << (zint(32) % 16) << std::endl;
prints[0, 0, 0, 0]
if SSE extensions are used. - DRY. Write once, run faster everywhere
- Runtime feature selection. The dispatcher checks the system features and selects the best fitting implementation.
- Easy integration. ZACC offers cmake scripts to build your project.
- Portability. ZACC accelerated projects should be able to run on any OS and any processor.
- Speed. Although ZACC may be not the highest-optimized library in the world, speed combined with a great usability is a high priority.
- Linear algebra support
- Arithmetic operations
- Conditional operations
- Rounding operations
- Standard functions like abs, min, max, etc...
- Trigonometric functions (sin, cos, tan)
- Platform detection
- Runtime dispatching
- Kernel infrastructure
- Extended algorithms (STL-compatible)
- Uses vanilla C++14
The project is available as a direct submodule if you use git or released here.
If you decide for the submodule way, simply add it via git submodule add https://github.com/zz-systems/zacc.git
CMake is required in your project to be able to use ZACC and ZACC build system.
To execute an accelerated algorithm, you need a kernel interface, a kernel implementation and an entrypoint.
The kernel interface is the connection between the vectorized code in satellite assemblies and the main application. The separation is necessary, because the kernel implementation uses vector types, which must not appear in the main application and are hidden in satellite assemblies.
The vital function mapping for the dispatcher is provided by system::kernel_interface<_KernelInterface>
(The dispatcher relies on operator()(...)
overloads).
3 methods are already mapped, you have to declare them in the interface and implement in the kernel:
run(output_container_t &output)
run(const input_container &input, output_container &output)
configure(any argument...)
You can extend or change the mappings with your custom implementation. Also, you need to specify the input and output container types and provide a name for the kernel.
Below is an exemplary mandelbrot kernel interface - available in the examples.
#include <vector>
#include "zacc.hpp"
#include "math/matrix.hpp"
#include "util/algorithm.hpp"
#include "system/entrypoint.hpp"
#include "system/kernel_interface.hpp"
using namespace zacc;
using namespace math;
struct __mandelbrot
{
using output_container = std::vector<int>;
using input_container = std::vector<int>;
static constexpr auto kernel_name() { return "mandelbrot"; }
virtual void configure(vec2<int> dim, vec2<float> cmin, vec2<float> cmax, size_t max_iterations) = 0;
virtual void run(output_container_t &output) = 0;
};
using mandelbrot = system::kernel_interface<__mandelbrot>;
Now that you have specified the kernel interface, you may want to write the implementation. Please have in mind, that C++ own if/else won't work with vector types. You need to rethink and use branchless arithmetic. Nonetheless, the implementation does not differ much from the canonical Mandelbrot implementation and is able to use SSE2, SSE3, SSE4, FMA, AVX, AVX2 features of the host processor. And all that without having to touch intrinsics like here
Write once, run faster everywhere :)
#include "zacc.hpp"
#include "math/complex.hpp"
#include "math/matrix.hpp"
#include "util/algorithm.hpp"
#include "system/kernel.hpp"
#include "../interfaces/mandelbrot.hpp"
using namespace zacc;
using namespace math;
DISPATCHED struct mandelbrot_kernel : system::kernel<mandelbrot>,
allocatable<mandelbrot_kernel, arch>
{
vec2<zint> _dim;
vec2<zfloat> _cmin;
vec2<zfloat> _cmax;
size_t _max_iterations;
virtual void configure(vec2<int> dim, vec2<float> cmin, vec2<float> cmax, size_t max_iterations) override
{
_dim = dim;
_cmax = cmax;
_cmin = cmin;
_max_iterations = max_iterations;
}
virtual void run(mandelbrot::output_container &output) override
{
// populate output container
zacc::generate<zint>(std::begin(output), std::end(output), [this](auto i)
{
// compute 2D-position from 1D-index
auto pos = reshape<vec2<zfloat>>(make_index<zint>(zint(i)), _dim);
zcomplex<zfloat> c(_cmin.x + pos.x / zfloat(_dim.x - 1) * (_cmax.x - _cmin.x),
_cmin.y + pos.y / zfloat(_dim.y - 1) * (_cmax.y - _cmin.x));
zcomplex<zfloat> z = 0;
bfloat done = false;
zint iterations;
for (size_t j = 0; j < _max_iterations; j++)
{
// done when magnitude is >= 2 (or square magnitude is >= 4)
done = done || z.sqr_magnitude() >= 4.0;
// compute next complex if not done
z = z
.when(done)
.otherwise(z * z + c);
// increment if not done
iterations = iterations
.when(done)
.otherwise(iterations + 1);
// break if all elements are not zero
if (is_set(done))
break;
}
return iterations;
});
}
};
The so-called entrypoint is the low-level interface between the main application and vectorized implementations. Over this interface, the kernels are created and destroyed.
Here you declare your available kernel 'constructors' and 'destructors'.
The convention is {kernel_name}_create_instance()
and {kernel_name}_delete_instance(entrypoint *)
.
#include "{your_application_name}_arch_export.hpp"
#include "system/entrypoint.hpp"
extern "C"
{
{your_application_name}_ARCH_EXPORT zacc::system::entrypoint *mandelbrot_create_instance();
{your_application_name}_ARCH_EXPORT void mandelbrot_delete_instance(zacc::system::entrypoint *instance);
}
Here you implement your available kernel 'constructors' and 'destructors'. Usually, simply instantiating/deleting a kernel is sufficient, but a more complex logic can be introduced.
#include "entrypoint.hpp"
#include "system/arch.hpp"
#include "kernels/mandelbrot.hpp"
// create mandelbrot kernel instance
zacc::system::entrypoint *mandelbrot_create_instance()
{
return new zacc::examples::mandelbrot_kernel<zacc::arch::types>();
}
// destroy mandelbrot kernel instance
void mandelbrot_delete_instance(zacc::system::entrypoint* instance)
{
if(instance != nullptr)
delete instance;
}
Here you need to create a dispatcher for your kernel and configure / invoke the kernel. The kernel invocation happens inside the dispatcher, which acts as a proxy. The dispatcher offers the following methods
dispatch_some(...)
- dispatch on all available architectures (e.g kernel configuration)dispatch_one(...)
- dispatch on the best available architecture (e.g kernel execution)
#include "../interfaces/mandelbrot.hpp"
#include "system/kernel_dispatcher.hpp"
#include "math/matrix.hpp"
// mandelbrot config:
vec2<int> dimensions = {2048, 2048};
vec2<float> cmin = {-2, -2};
vec2<float> cmax = { 2, 2 };
size_t max_iterations = 2048;
// get kernel dispatcher
auto dispatcher = system::make_dispatcher<mandelbrot>();
// configure kernel
dispatcher.dispatch_some(_dim, cmin, cmax, max_iterations);
// prepare output
std::vector<int>(_dim.x * _dim.y);
// run
dispatcher.dispatch_one(result);
...
# add zacc targets
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/dependencies/zacc)
# use zacc build system
include(${CMAKE_CURRENT_SOURCE_DIR}/dependencies/zacc/cmake/zacc.shared.cmake)
# add include lookup directories
include_directories(
${CMAKE_CURRENT_SOURCE_DIR}/dependencies/zacc/include
)
Defines a shared/dynamic library with dispatcher and kernel implementations in additional libraries.
# your shared library which aggregates the branches
zacc_add_dispatched_library(_your_library_
# your library entrypoint
ENTRYPOINT ${CMAKE_SOURCE_DIR}/_your_library_entrypoint.cpp
# additional includes
INCLUDES ${CMAKE_SOURCE_DIR}/include ${CMAKE_SOURCE_DIR}/dependencies/zacc/include
# branches to build for
BRANCHES "${branches}"
# your main library source
SOURCES
${CMAKE_SOURCE_DIR}/_your_library_.cpp
)
Defines a main application with dispatcher and kernel implementations in additional libraries.
zacc_add_dispatched_executable(_your_application_
# branches to build for
BRANCHES "${branches}"
# additional includes
INCLUDES
${PROJECT_SOURCE_DIR}/include
# your kernel entrypoint
ENTRYPOINT
${PROJECT_SOURCE_DIR}/_your_application_entrypoint.cpp
# your main application sources
SOURCES
${PROJECT_SOURCE_DIR}/_your_application_.cpp
)
Defines unit test targets using GoogleTest
# unit testing your implementation on all branches
# find the test main (you may provide your own implementation)
file(GLOB ZACC_TEST_MAIN "${PROJECT_SOURCE_DIR}/*/zacc/*/test_main.cpp")
# find the test entry point (you may provide your own implementation)
file(GLOB ZACC_TEST_ENTRYPOINT "${PROJECT_SOURCE_DIR}/*/zacc/*/test_entry_point.cpp")
zacc_add_dispatched_tests(_your_tests_
# test main. used to skip the tests if the processing unit is not
# capable of running a particular featureset
TEST_MAIN ${ZACC_TEST_MAIN}
# gtest main
TEST_ENTRYPOINT ${ZACC_TEST_ENTRYPOINT}
# branches to build for
BRANCHES "${branches}"
# additional include directories
INCLUDES ${CMAKE_SOURCE_DIR}/include
# your test sources
SOURCES
${_your_test_files_here}
)
- In development!
- Used in cacophony - a coherent noise library
Processor | Highest featureset |
---|---|
AMD FX-8350 | AVX1 |
Intel Core i7 6500U | AVX2 + FMA |
Intel Core i7 7700K | AVX2 + FMA |
Intel Xeon E5-2697 v3 | AVX2 + FMA |
Intel Xeon E5-2680 v3 | AVX2 + FMA |
Intel Xeon E5-2680 v2 | AVX1 |
Intel Xeon X5570 | SSE4.1 |
- Mac OS X Sierra / High Sierra
- Linux
- Windows 10
Featureset | State | |
---|---|---|
x87 FPU | ✅ | scalar |
SSE2 | ✅ | |
SSE3 | ✅ | |
SSE3 + SSSE3 | ✅ | |
SSE4.1 | ✅ | |
SSE4.1 + FMA3 | ✅ | |
SSE4.1 + FMA4 | ✅ | |
AVX1 | ⛔ | Integer vector emulation faulty. |
AVX1 + FMA3 | ⛔ | Integer vector emulation faulty. |
AVX2 | ✅ | |
AVX512 | ⛔ | in development, can't be tested yet* |
ARM NEON | ⛔ | Not implemented yet |
GPGPU | ⛔ | Not implemented yet** |
FPGA | ⛔ | Not implemented yet*** |
*For AVX512, access to a Xeon Phi accelerator or a modern Xeon CPU is necessary
**Some work is already done for the OpenCL implementation. Some macros or C++ code postprocessing may be introduced.
***Same starting issues as for the GPGPU feature, the code generation is another topic.
Compiler | State | |
---|---|---|
GCC 5 | ✅ | |
GCC 6 | ✅ | |
GCC 7 | ✅ | |
Clang 3.9 | ⛔ | Not compilable |
Clang 4.0 | ✅ | |
LLVM version 8.1.0 | ⛔ | Not compilable |
LLVM version 9.0.0 | ✅ | |
Clang-cl | ✅ | |
MSVC | ⛔ | Not supported* |
*MSVC is not supported due to required fine granular compile options and non-conform C++ implementation. Instead Clang-cl is used, which is binary compatible with MSVC (work in progress).
C++ scalar type | ZACC vector type | State | |
---|---|---|---|
signed int8 | zint8, zbyte | ✅ | Partially emulated. |
signed int16 | zint16, zshort | ✅ | |
signed int32 | zint32, zint | ✅ | |
signed int64 | zint64, zlong | ⛔ | Not implemented yet |
float16 | zfloat16 | ⛔ | Not implemented yet |
float32 | zfloat, zfloat32 | ✅ | |
float64 | zdouble, zfloat64 | ✅ |
The library is licensed under the MIT License:
Copyright © 2015-2018 Sergej Zuyev
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
To compile and run the tests, you need to execute
$ make zacc.tests.all
$ ctest
--------------------------------------------------------------------
Start 1: ci.zacc.tests.scalar
1/8 Test #1: ci.zacc.tests.scalar ............. Passed 0.01 sec
Start 2: ci.zacc.tests.sse.sse2
2/8 Test #2: ci.zacc.tests.sse.sse2 ........... Passed 0.01 sec
Start 3: ci.zacc.tests.sse.sse3
3/8 Test #3: ci.zacc.tests.sse.sse3 ........... Passed 0.01 sec
Start 4: ci.zacc.tests.sse.sse41
4/8 Test #4: ci.zacc.tests.sse.sse41 .......... Passed 0.01 sec
Start 5: ci.zacc.tests.sse.sse41.fma3
5/8 Test #5: ci.zacc.tests.sse.sse41.fma3 ..... Passed 0.01 sec
Start 6: ci.zacc.tests.sse.sse41.fma4
6/8 Test #6: ci.zacc.tests.sse.sse41.fma4 ..... Passed 0.00 sec
Start 7: ci.zacc.tests.avx
7/8 Test #7: ci.zacc.tests.avx ................ Passed 0.01 sec
Start 8: ci.zacc.tests.avx2
8/8 Test #8: ci.zacc.tests.avx2 ............... Passed 0.01 sec
100% tests passed, 0 tests failed out of 8
Total Test time (real) = 0.11 sec