Skip to content

v3.0

Compare
Choose a tag to compare
@adammoody adammoody released this 16 Feb 19:26
· 569 commits to develop since this release
  • Added Python bindings for the SCR library
    • Supports Python 2 and 3
    • Implemented in the scr.py module (import scr)
    • Uses the C Foreign Function Interface (CFFI) to wrap calls to libscr
    • To use the Python bindings, first install SCR, then follow the steps in the python/README.md
  • Improved support for large datasets and shared access to files. Applications can now configure SCR to bypass the cache and access datasets on the global file system:
    • For datasets that are too large to fit in cache or for systems that have no cache available, SCR can use the global file system. This improves portability so that applications can use SCR on any cluster.
    • Since bypass mode is more general, it is enabled by default. To use cache, one must disable bypass mode by setting (SCR_CACHE_BYPASS=0).
    • For applications that write shared files, SCR can use bypass mode during the SCR Checkpoint/Output API.
    • For applications that write datasets as a file-per-process but require shared access to files during restart, one can write to cache but set SCR_GLOBAL_RESTART=1. This rebuilds and flushes cached datasets during SCR_Init. It also enables bypass mode for restart so that the application can read its dataset from the global file system using the SCR Restart API.
  • Applications can now instruct SCR to load a specific checkpoint by naming it in the SCR_CURRENT parameter before calling SCR_Init.
  • Restart loop:
    • SCR now supports a loop around SCR_Have_restart, SCR_Start_restart, and SCR_Complete_restart. If an application detects a problem during its restart, it can pass valid=0 to SCR_Complete_restart. SCR will then load the next most recent checkpoint, which the application can query with another call to SCR_Have_restart. This process can be continued until either a checkpoint is read successfully or all checkpoints have been exhausted.
  • SCR_Need_checkpoint now returns false unless one has set one of SCR_CHECKPOINT_INTERVAL/SECONDS/OVERHEAD
  • Restored watchdog support on SLURM systems
  • New build options:
    • Added support for static-only builds with -DBUILD_SHARED_LIBS=OFF
    • Added CMake options to disable portions of the build including -DENABLE_EXAMPLES=[ON/OFF] and -DENABLE_TESTS=[ON/OFF]
    • Added support to specify the number of trailing underscores for Fortran bindings with -DENABLE_FORTRAN_TRAILING_UNDERSCORES=[AUTO/ON/OFF]
  • New API calls:
    • SCR_Config(const char* config) to set and query SCR configuration parameters before SCR_Init(), and query parameters after SCR_Init().
    • SCR_Configf(const char* config, ...) a version of SCR_Config that supports printf-style formatting.
    • SCR_Current(const char* name) enables an application that reads its checkpoint without using the SCR Restart API to inform SCR about which checkpoint it loaded so that SCR can still track the proper ordering of checkpoints
    • SCR_Delete(const char* name) to ask SCR to delete a dataset
    • SCR_Drop(const char* name) to ask SCR to drop a dataset from the index without deleting the underlying data files
  • Improved flush methods
    • Added IBM BB API (https://github.com/IBM/CAST), e.g., SCR_FLUSH_TYPE=BBAPI
    • Added pthreads, e.g., SCR_FLUSH_TYPE=PTHREAD
    • Added support for multiple outstanding asynchronous flushes
    • Initial support for scr_poststage of BBAPI transfers after completion of allocation (beta)
  • New redundancy scheme:
    • Reed-Solomon encoding (SCR_COPY_TYPE=RS) allows a configurable number of failures per group, from 1 to N-1 where N is the set size. Use SCR_SET_SIZE to specify the group size and SCR_SET_FAILURES to specify the number of failures per group.
  • SCR configuration parameters now support interpolation of environment variables in configuration files, e.g.,
    >>: cat .scrconf
    SCR_CACHE_BASE=$BBPATH
    
  • Default path for SCR system configuration file moved from /etc/scr.conf to <install>/etc/scr.conf
  • SCR now preserves file metadata including atime, mtime, uid, gid, and mode bits
  • New logging options:
    • text file - written to the SCR prefix directory (SCR_LOG_TXT_ENABLE=1)
    • syslog - one can configure the syslog prefix, facility, and level to be used (SCR_LOG_SYSLOG_ENABLE=1)
  • Apps can now configure SCR to maintain a sliding window of checkpoints on the parallel file system with an SCR_PREFIX_SIZE parameter. After flushing a new checkpoint, SCR will delete older checkpoints
  • Default cache and control directories have been moved from /tmp to /dev/shm on Linux systems
  • Assists for application developers when integrating the SCR API
    • A new SCR_CACHE_PURGE parameter configures SCR to delete datasets from cache in new runs
    • A new SCR_PREFIX_PURGE parameter similarly deletes datasets from the prefix directory in new runs
    • Added internal checks to warn developers about incorrect API usage
  • Refactored code base to use ECP-VeloC components https://github.com/ecp-veloc/
    • Improves code modularity and reuse
    • Improved testing
    • New release tarball packages source for SCR and many of its components to simplify direct builds, e.g.,
      wget https://github.com/LLNL/scr/releases/download/v3.0/scr-v3.0.tgz
      tar -xzf scr-v3.0.tgz
      cd scr-v3.0
      mkdir build
      cd build
      cmake -DCMAKE_INSTALL_PREFIX=../install -DSCR_RESOURCE_MANAGER=SLURM ../
      make -j install