BLIS is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. BLIS is written in ISO C99 and available under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like API, it also includes a BLAS compatibility layer which gives application developers access to BLIS implementations via traditional BLAS routine calls.
For a thorough presentation of our framework, please read our recently accepted journal article, "BLIS: A Framework for Rapidly Instantiating BLAS Functionality". For those who just want an executive summary, please see the next section.
In a follow-up article, "The BLIS Framework: Experiments in Portability", we investigate using BLIS to instantiate level-3 BLAS implementations on a variety of general-purpose, low-power, and multicore architectures.
An IPDPS'14 conference paper titled "Anatomy of High-Performance Many-Threaded Matrix Multiplication" systematically explores the opportunities for parallelism within the five loops that BLIS exposes in its matrix multiplication algorithm.
It is our belief that BLIS offers substantial benefits in productivity when compared to conventional approaches to developing BLAS libraries, as well as a much-needed refinement of the BLAS interface, and thus constitutes a major advance in dense linear algebra computation. While BLIS remains a work-in-progress, we are excited to continue its development and further cultivate its use within the community.
BLIS offers several advantages over traditional BLAS libraries:
-
Portability that doesn't impede high performance. Portability was a top priority of ours when creating BLIS. With zero additional effort on the part of the developer, BLIS is configurable as a fully-functional reference implementation. But more importantly, the framework identifies and isolates a key set of computational kernels which, when optimized, immediately and automatically optimize performance across virtually all level-2 and level-3 BLIS operations. In this way, the framework acts as a productivity multiplier. And since the optimized (non-portable) code is compartmentalized within these few kernels, instantiating a high-performance BLIS library on a new architecture is a relatively straightforward endeavor.
-
Generalized matrix storage. The BLIS framework exports interfaces that allow one to specify both the row stride and column stride of a matrix. This allows one to compute with matrices stored in column-major order, row-major order, or by general stride. (This latter storage format is important for those seeking to implement tensor contractions on multidimensional arrays.) Furthermore, since BLIS tracks stride information for each matrix, operands of different storage formats can be used within the same operation invocation. By contrast, BLAS requires column-major storage. And while the CBLAS interface supports row-major storage, it does not allow mixing storage formats.
-
Full support for the complex domain. BLIS operations are developed and expressed in their most general form, which is typically in the complex domain. These formulations then simplify elegantly down to the real domain, with conjugations becoming no-ops. Unlike the BLAS, all input operands in BLIS that allow transposition and conjugate-transposition also support conjugation (without transposition), which obviates the need for thread-unsafe workarounds. Also, where applicable, both complex symmetric and complex Hermitian forms are supported. (BLAS omits some complex symmetric operations, such as
symv
,syr
, andsyr2
.) -
Advanced multithreading support. BLIS allows multiple levels of symmetric multithreading for nearly all level-3 operations. (Currently, users may choose to obtain parallelism via either OpenMP or POSIX threads). This means that matrices may be partitioned in multiple dimensions simultaneously to attain scalable, high-performance parallelism on multicore and many-core architectures. The key to this innovation is a thread-specific control tree infrastructure which encodes information about the logical thread topology and allows threads to query and communicate data amongst one another. BLIS also employs so-called "quadratic partitioning" when computing dimension sub-ranges for each thread, so that arbitrary diagonal offsets of structured matrices with unreferenced regions are taken into account to achieve proper load balance.
-
Ease of use. The BLIS framework, and the library of routines it generates, are easy to use for end users, experts, and vendors alike. An optional BLAS compatibility layer provides application developers with backwards compatibility to existing BLAS-dependent codes. Or, one may adjust or write their application to take advantage of new BLIS functionality (such as generalized storage formats or additional complex operations) by calling BLIS directly. BLIS's interfaces will feel familiar to many veterans of BLAS since BLIS exports APIs with BLAS-like calling sequences. And experts will find BLIS's internal object-based APIs a delight to use when customizing or writing their own BLIS operations. (Objects are relatively lightweight
structs
and passed by address, which helps tame function calling overhead.) -
Multilayered API and exposed kernels. The BLIS framework exposes its implementations in various layers, allowing expert developers to access exactly the functionality desired. This layered interface includes that of the lowest-level kernels, for those who wish to bypass the bulk of the framework. Optimizations can occur at various levels, in part thanks to exposed packing and unpacking facilities, which by default are highly parameterized and flexible.
-
Functionality that grows with the community's needs. As its name suggests, the BLIS framework is not a single library or static API, but rather a nearly-complete template for instantiating high-performance BLAS-like libraries. Furthermore, the framework is extensible, allowing developers to leverage existing components to support new operations as they are identified. If such operations require new kernels for optimal efficiency, the framework and its APIs will be adjusted and extended accordingly.
-
Code re-use. Auto-generation approaches to achieving the aforementioned goals tend to quickly lead to code bloat due to the multiple dimensions of variation supported: operation (i.e.
gemm
,herk
,trmm
, etc.); parameter case (i.e. side, [conjugate-]transposition, upper/lower storage, unit/non-unit diagonal); datatype (i.e. single-/double-precision real/complex); matrix storage (i.e. row-major, column-major, generalized); and algorithm (i.e. partitioning path and kernel shape). These "brute force" approaches often consider and optimize each operation or case combination in isolation, which is less than ideal when the goal is to provide entire libraries. BLIS was designed to be a complete framework for implementing basic linear algebra operations, but supporting this vast amount of functionality in a manageable way required a holistic design that employed careful abstractions, layering, and recycling of generic (highly parameterized) codes, subject to the constraint that high performance remain attainable. -
A foundation for mixed domain and/or mixed precision operations. BLIS was designed with the hope of one day allowing computation on real and complex operands within the same operation. Similarly, we wanted to allow mixing operands' floating-point precisions, or both domain and precision. Unfortunately, this feature results in a significant amount of additional code, mostly in level-2 and lower operations, thus, it is disabled by default. However, mixing domains in level-3 operations is possible, in theory, with almost no additional effort on the part of the library developer, and such operations would remain capable of high performance. (Please note that this functionality is still highly experimental and should be thought of as a feature that will be more thoroughly implemented at some future date.)
If you just want to browse a quick-reference guide on user-level BLIS interfaces, please read the BLIS API quick reference. There you will find a brief description of each operation as well as some more general information needed when developing an application with BLIS.
Have a quick question? You may find the answer in our list of frequently asked questions.
Does BLIS contain kernels optimized for your favorite architecture? Please see our Hardware Support wiki for a full list of optimized kernels.
We also provide wikis on the following topics, which will likely be of interest to many users and developers:
- Build system. This wiki provides step-by-step instructions for building a BLIS library. (Reminder: While BLIS supports configure-time hardware detection for certain architectures, you may need to manually specify a configuration to use.)
- Configuration. This wiki describes how to create a BLIS "configuration", which captures all of the details necessary to build BLIS for a specific hardware architecture. Configurations specify things like blocksizes, kernel names, and various optional configuration settings.
- Kernels. This wiki describes each of the BLIS kernel operations in detail and should provide developers with most of the information needed to get started with writing and optimizing their own kernels.
- Test suite. This wiki contains detailed instructions on running the BLIS test suite, located in the top-level directory testsuite.
You can keep in touch with developers and other users of the project by joining one of the following mailing lists:
-
blis-discuss: Please join and post to this mailing list if you have general questions or feedback regarding BLIS. Application developers (end users) should probably post here, unless they have bug reports, in which case they should post to blis-devel.
-
blis-devel: Please join and post to this mailing list if you are a BLIS developer (i.e., you are trying to use BLIS to create libraries, you want to write kernels for the framework, or you are trying to modify or extend the framework itself). Also, if you would like to submit a bug report, or discuss a possible bug, please use this list. Note: Most of the interesting discussions happen here; don't be afraid to join!
For those of you looking for the appropriate article to cite regarding BLIS, we recommend citing our first ACM TOMS journal paper:
@article{BLIS1,
author = {Field G. {V}an~{Z}ee and Robert A. {v}an~{d}e~{G}eijn},
title = {{BLIS}: A Framework for Rapidly Instantiating {BLAS} Functionality},
journal = {ACM Transactions on Mathematical Software},
volume = {41},
number = {3},
pages = {14:1--14:33},
year = {2015},
issue_date = {June 2015},
url = {http://doi.acm.org/10.1145/2764454},
}
You may also cite the second ACM TOMS journal paper:
@article{BLIS2,
author = {Field G. {V}an~{Z}ee and Tyler Smith and Francisco D. Igual and
Mikhail Smelyanskiy and Xianyi Zhang and Michael Kistler and Vernon Austel and
John Gunnels and Tze Meng Low and Bryan Marker and Lee Killough and
Robert A. {v}an~{d}e~{G}eijn},
title = {The {BLIS} Framework: Experiments in Portability},
journal = {ACM Transactions on Mathematical Software},
year = 2015,
note = {Accepted},
}
We also have a third paper, submitted to IPDPS 2014, on achieving multithreaded parallelism in BLIS:
@inproceedings{BLIS3,
author = {Tyler M. Smith and Robert A. {v}an~{d}e~{G}eijn and Mikhail Smelyanskiy and
Jeff R. Hammond and Field G. {V}an~{Z}ee},
title = {Anatomy of High-Performance Many-Threaded Matrix Multiplication},
booktitle = {28th IEEE International Parallel \& Distributed Processing Symposium
(IPDPS 2014)},
year = 2014,
}
A fourth paper, submitted to ACM TOMS, also exists, which proposes an analytical model for determining blocksize parameters in BLIS:
@article{BLIS4,
author = {Tze Meng Low and Francisco D. Igual and Tyler M. Smith and Enrique S. Quintana-Ort\'{\i}},
title = {Analytical Models for the {BLIS} Framework},
journal = {ACM Transactions on Mathematical Software},
year = 2015,
note = {Pending},
}
This project and its associated research was partially sponsored by grants from Microsoft, Intel, Texas Instruments, and AMD, as well as grants from the National Science Foundation (Awards CCF-0917167 ACI-1148125/1340293, and CCF-1320112).
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).