v3.0.2012
PaRSEC 20.12 (December 2020)
-
PaRSEC API 3.0
-
PaRSEC now requires CMake 3.16.
-
New configure system to ease the installation of PaRSEC. See
INSTALL for details. This system automates installation on most DOE
leadership systems. -
Split DPLASMA and PaRSEC into separate repositories. PaRSEC moves from
cmake-2.0 to cmake-3.12, using targets. Targets are exported for
third-party integration -
Add visualization tools to extract user-defined properties from the
application (see: PR 229 visualization-tools) -
Automate expression of required data transfers from host-to-device and
device-to-host to satisfy depencencies (and anti-dependencies). PaRSEC tracks
multiple versions of the same data as data copies with a coherency algorithm
that initiates data transfers as needed. The heurisitic for the eviction policy
in out-of-memory event on GPU has been optimized to allow for efficient
operation in larger than GPU memory problems. -
Add support for MPI out-of-order matching capabilities; Added capability
for compute threads to send direct control messages to indicate completion
of tasks to remote nodes (without delegation to the communication thread) -
Remove communication mode EAGER from the runtime. It had a rare
but hard to correct bug that would rarely deadlock, and the performance
benefit was small. -
Add a Map operator on the Block Cyclic matrix data collection that
performs in-place data transformation on the collection with a user provided
operator. -
Add support in the runtime for user-defined properties evaluated at
runtime and easy to export through a shared memory region (see: PR
229 visualization-tools) -
Add a PAPI-SDE interface to the parsec library, to expose internal
counters via the PAPI-Software Defined Events interface. -
Add a backend support for OTF2 in the profiling mechanism. OTF2 is
used automatically if a OTF2 installation is found. -
Add a MCA parameter to control the number of ejected blocks from GPU
memory (device_cuda_max_number_of_ejected_data). Add a MCA parameter
to control wether or not the GPU engine will take some time to sort
the first N tasks of the pending queue (device_cuda_sort_pending_list). -
Reshape the users vision of PaRSEC: they only have to include a single
header (parsec.h) for most usages, and link with a single library
(-lparsec). -
Update the PaRSEC DSL handling of initial tasks. We now rely on 2
pieces of information: the number of DSL tasks, and the number of
tasks imposed by the system (all types of data transfer). -
Add a purely local scheduler (ll), that uses a single LIFO per
thread. Each schedule operation does 1 atomic (push in local queue),
each select operation does up to t atomics (pop in local queue, then
try any other thread's queue until they are all tested empty). -
Add a --ignore-properties=... option to parsec_ptgpp
-
Change API of hash tables: allow keys of arbitrary size. The API
features how to build a key from a task; how to hash a key into
1 <= N <= 64 bits; and how to compare twy keys (plus a printing
function to debug). -
Change behavior of DEBUG_HISTORY: log all information inside
a buffer of fixed size (MCA parameter) per thread, do not allocate
memory during logging, and use timestamp to re-order output
when the user calls dump() -
DTD interface is updated (new flag to send pointer as parameter,
unpacking of paramteres is simpler etc). -
DTD provides mca param (dtd_debug_verbose) to print information
about traversal of DAG in a separate output stream from the default.