Skip to content

Releases: LLNL/scr

v3.1.0

13 Aug 19:09
c96e29a
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.0...v3.1.0

v3.0.1

14 Jul 18:01
Compare
Choose a tag to compare

This release provides a few performance improvements over v3.0:

  • raises default SCR_MPI_BUF_SIZE from 128KiB to 1MiB
  • raises default SCR_FILE_BUF_SIZE from 1MiB to 32MiB
  • adds new SCR_FLUSH_ASYNC_USLEEP to configure sleep time while waiting on async flush to complete, and lowers default sleep time from 10 seconds to 1000 microseconds
  • update to ER-v0.3.0
    • increases file buffer size from 1MiB to 32MiB
  • update AXL-v0.7.0
    • increase file buffer size from 1MiB to 32MiB
    • disables writing to an _AXL temporary file to avoid slow rename step on some file systems

v3.0

16 Feb 19:26
Compare
Choose a tag to compare
  • Added Python bindings for the SCR library
    • Supports Python 2 and 3
    • Implemented in the scr.py module (import scr)
    • Uses the C Foreign Function Interface (CFFI) to wrap calls to libscr
    • To use the Python bindings, first install SCR, then follow the steps in the python/README.md
  • Improved support for large datasets and shared access to files. Applications can now configure SCR to bypass the cache and access datasets on the global file system:
    • For datasets that are too large to fit in cache or for systems that have no cache available, SCR can use the global file system. This improves portability so that applications can use SCR on any cluster.
    • Since bypass mode is more general, it is enabled by default. To use cache, one must disable bypass mode by setting (SCR_CACHE_BYPASS=0).
    • For applications that write shared files, SCR can use bypass mode during the SCR Checkpoint/Output API.
    • For applications that write datasets as a file-per-process but require shared access to files during restart, one can write to cache but set SCR_GLOBAL_RESTART=1. This rebuilds and flushes cached datasets during SCR_Init. It also enables bypass mode for restart so that the application can read its dataset from the global file system using the SCR Restart API.
  • Applications can now instruct SCR to load a specific checkpoint by naming it in the SCR_CURRENT parameter before calling SCR_Init.
  • Restart loop:
    • SCR now supports a loop around SCR_Have_restart, SCR_Start_restart, and SCR_Complete_restart. If an application detects a problem during its restart, it can pass valid=0 to SCR_Complete_restart. SCR will then load the next most recent checkpoint, which the application can query with another call to SCR_Have_restart. This process can be continued until either a checkpoint is read successfully or all checkpoints have been exhausted.
  • SCR_Need_checkpoint now returns false unless one has set one of SCR_CHECKPOINT_INTERVAL/SECONDS/OVERHEAD
  • Restored watchdog support on SLURM systems
  • New build options:
    • Added support for static-only builds with -DBUILD_SHARED_LIBS=OFF
    • Added CMake options to disable portions of the build including -DENABLE_EXAMPLES=[ON/OFF] and -DENABLE_TESTS=[ON/OFF]
    • Added support to specify the number of trailing underscores for Fortran bindings with -DENABLE_FORTRAN_TRAILING_UNDERSCORES=[AUTO/ON/OFF]
  • New API calls:
    • SCR_Config(const char* config) to set and query SCR configuration parameters before SCR_Init(), and query parameters after SCR_Init().
    • SCR_Configf(const char* config, ...) a version of SCR_Config that supports printf-style formatting.
    • SCR_Current(const char* name) enables an application that reads its checkpoint without using the SCR Restart API to inform SCR about which checkpoint it loaded so that SCR can still track the proper ordering of checkpoints
    • SCR_Delete(const char* name) to ask SCR to delete a dataset
    • SCR_Drop(const char* name) to ask SCR to drop a dataset from the index without deleting the underlying data files
  • Improved flush methods
    • Added IBM BB API (https://github.com/IBM/CAST), e.g., SCR_FLUSH_TYPE=BBAPI
    • Added pthreads, e.g., SCR_FLUSH_TYPE=PTHREAD
    • Added support for multiple outstanding asynchronous flushes
    • Initial support for scr_poststage of BBAPI transfers after completion of allocation (beta)
  • New redundancy scheme:
    • Reed-Solomon encoding (SCR_COPY_TYPE=RS) allows a configurable number of failures per group, from 1 to N-1 where N is the set size. Use SCR_SET_SIZE to specify the group size and SCR_SET_FAILURES to specify the number of failures per group.
  • SCR configuration parameters now support interpolation of environment variables in configuration files, e.g.,
    >>: cat .scrconf
    SCR_CACHE_BASE=$BBPATH
    
  • Default path for SCR system configuration file moved from /etc/scr.conf to <install>/etc/scr.conf
  • SCR now preserves file metadata including atime, mtime, uid, gid, and mode bits
  • New logging options:
    • text file - written to the SCR prefix directory (SCR_LOG_TXT_ENABLE=1)
    • syslog - one can configure the syslog prefix, facility, and level to be used (SCR_LOG_SYSLOG_ENABLE=1)
  • Apps can now configure SCR to maintain a sliding window of checkpoints on the parallel file system with an SCR_PREFIX_SIZE parameter. After flushing a new checkpoint, SCR will delete older checkpoints
  • Default cache and control directories have been moved from /tmp to /dev/shm on Linux systems
  • Assists for application developers when integrating the SCR API
    • A new SCR_CACHE_PURGE parameter configures SCR to delete datasets from cache in new runs
    • A new SCR_PREFIX_PURGE parameter similarly deletes datasets from the prefix directory in new runs
    • Added internal checks to warn developers about incorrect API usage
  • Refactored code base to use ECP-VeloC components https://github.com/ecp-veloc/
    • Improves code modularity and reuse
    • Improved testing
    • New release tarball packages source for SCR and many of its components to simplify direct builds, e.g.,
      wget https://github.com/LLNL/scr/releases/download/v3.0/scr-v3.0.tgz
      tar -xzf scr-v3.0.tgz
      cd scr-v3.0
      mkdir build
      cd build
      cmake -DCMAKE_INSTALL_PREFIX=../install -DSCR_RESOURCE_MANAGER=SLURM ../
      make -j install
      

v3.0rc2

13 Oct 19:54
Compare
Choose a tag to compare

This is the second release candidate for v3.0. This adds the following new features and bug fixes on top of items listed in v3.0rc1.

New Features:

  • Added support for multiple outstanding asynchronous flushes
  • Added support to set SCR_PREFIX through SCR_Config
  • Enable queries with SCR_Config after SCR_Init has been called
  • Changed SCR_Config behavior so that SCR assumes default values for all parameters on each run, rather than reading the app.conf file to use values set by SCR_Config in a previous run
  • SCR_Need_checkpoint now returns false unless one has set one of SCR_CHECKPOINT_INTERNAL/SECONDS/OVERHEAD
  • Added support to specify the number of trailing underscores for Fortran bindings with -DENABLE_FORTRAN_TRAILING_UNDERSCORES=[AUTO/ON/OFF]
  • Restored watchdog support on SLURM systems
  • Initial support for scr_poststage of BBAPI transfers after completion of allocation (beta)
  • Added support for static-only builds with -DBUILD_SHARED_LIBS=OFF
  • Added CMake options to disable portions of the build including -DENABLE_EXAMPLES=[ON/OFF] and -DENABLE_TESTS=[ON/OFF]
  • Release tarball scr-top has been refactored to merge SCR and its immediate dependencies into a single library (libscr) for a faster build and a simplified link step

Bug fixes:

  • Auto define store descriptors for default cache and control directories
  • Use proper cache directories during scavenge when control directory and cache directory are different
  • Update SCR_FLUSH_ASYNC_TYPE=PTHREAD to allow asynchronous flush
  • Enable use of = characters in SCR_Config values

v3.0rc1

16 Apr 20:08
Compare
Choose a tag to compare

This is release candidate for v3.0.

  • Improved support for large datasets and shared access to files. Applications can now configure SCR to bypass the cache and access datasets on the global file system:
    • Since bypass mode is more general, it is enabled by default. To use cache, one must disable bypass mode (SCR_CACHE_BYPASS=0).
    • For datasets that are too large to fit in cache or for systems that have no cache available, SCR can use the global file system. This improves portability so that applications can use SCR on any cluster.
    • For applications that write shared files, SCR can use bypass mode during the SCR Checkpoint/Output API.
    • For applications that write datasets as a file-per-process but require shared access to files during restart, one can write to cache but enable SCR_GLOBAL_RESTART. This rebuilds and flushes cached datasets during SCR_Init. It also enables bypass mode for restart, so an application can read its dataset from the global file system using the SCR Restart API.
  • Applications can now instruct SCR to load a specific checkpoint by naming it in the SCR_CURRENT parameter before calling SCR_Init.
  • Restart loop:
    • SCR now supports a loop around SCR_Have_restart, SCR_Start_restart, and SCR_Complete_restart. If an application detects a problem during its restart, it can pass valid=0 to SCR_Complete_restart. SCR will then load the next most recent checkpoint, which the application can query with another call to SCR_Have_restart.
  • New API calls:
    • SCR_Config(const char* config) to set and query SCR configuration parameters before SCR_Init()
    • SCR_Current(const char* name) enables an application that reads its checkpoint without using the SCR Restart API to inform SCR about which checkpoint it loaded so that SCR can still track the proper ordering of checkpoints
    • SCR_Delete(const char* name) to ask SCR to delete a dataset
    • SCR_Drop(const char* name) to ask SCR to drop a dataset from the index without deleting the underlying data files
  • New flush methods
  • New redundancy scheme:
    • Reed-Solomon encoding (SCR_COPY_TYPE=RS) allows a configurable number of failures per group, from 1 to N-1 where N is the set size. Use SCR_SET_SIZE to specify the group size and SCR_SET_FAILURES to specify the number of failures per group.
  • SCR configuration parameters now support interpolation of environment variables in configuration files, e.g.,
    >>: cat .scrconf
    SCR_CACHE_BASE=$BBPATH
    
  • SCR now preserves file metadata including atime, mtime, uid, gid, and mode bits
  • New logging options:
    • text file - written to the SCR prefix directory (SCR_LOG_TXT_ENABLE=1)
    • syslog - one can configure the syslog prefix, facility, and level to be used (SCR_LOG_SYSLOG_ENABLE=1)
  • Apps can now configure SCR to maintain a sliding window of checkpoints on the parallel file system with an SCR_PREFIX_SIZE parameter. After flushing a new checkpoint, SCR will delete older checkpoints
  • Default cache and control directories have been moved from /tmp to /dev/shm on Linux systems
  • Assists for application developers when integrating the SCR API
    • A new SCR_CACHE_PURGE parameter configures SCR to delete datasets from cache in new runs
    • A new SCR_PREFIX_PURGE parameter similarly deletes datasets from the prefix directory in new runs
    • Added internal checks to warn developers about incorrect API usage
  • Added Python bindings for SCR library (beta)
    • Implemented in an scr.py module (import scr)
    • Uses C Foreign Function Interface (CFFI) to wrap C functions in libscr
    • Supports Python 2 and 3
  • Refactored code base to use ECP-VeloC components https://github.com/ecp-veloc/
    • Improves code modularity and reuse
    • Improved testing
    • scr-top package (https://github.com/llnl/scr-top) includes source for SCR and its ECP-VeloC components to simplify direct installs, e.g.,
      tar -xzf scr-top-v3.0rc1.tgz
      cd scr-top-v3.0rc1
      mkdir build install
      cd build
      cmake -DCMAKE_INSTALL_PREFIX=../install -DSCR_RESOURCE_MANAGER=SLURM ../
      make install
      

SCR v2.0.0

29 Mar 00:11
Compare
Choose a tag to compare

🎉 SCR Version 2.0 🎉

This release marks a milestone is SCR's long history of bringing dependable, scalable, file set management to multiple HPC platforms.

Some highlights include:

  • Support for multiple platform specific hardware technologies, including Cray DataWarp
  • Portability across many HPC centers via scheduler integration
  • Scalable checkpoint resilience and restart capabilities

SCR v1.2.2

15 Oct 23:26
e32a324
Compare
Choose a tag to compare

Updates the SCR command SCR_Route_file to always be successful. In the case where SCR_Route_file is called outside of a start/complete pair or when SCR is disabled, the original file path is simply copied to the return string.

SCR v1.2.1

02 Feb 14:44
Compare
Choose a tag to compare

This release includes a refresh of the SCR documentation which accompanies version 1.2.0:

  • NEW: SCR user documentation is now live and always up-to-date at scr.rtfd.io.
  • We've updated the various in-repo references to and copies of the user manual.

SCR v1.2.0

28 Nov 01:16
ece798a
Compare
Choose a tag to compare

This release includes many new features for SCR. Details can be found in the latest user manual: SCRv1.2-User-Manual.pdf.

New API Features:

  • char * SCR_Get_version (void). SCR's version information also appears in the scr.h header file.
  • We now have support for arbitrary file set outputs (not just checkpoints). Users can call int SCR_Start_output (const char* name, int flags) and int SCR_Complete_output (int valid) to wrap both checkpoint and arbitrary output write phases of their applications. The flags parameter should be used to describe the file set: SCR_FLAG_NONE, SCR_FLAG_CHECKPOINT, SCR_FLAG_OUTPUT. These flags can be combined with bit-wise or, |, for a single file set.
  • SCR has added functions for marking a restart phase of an application. Users can call int SCR_Have_restart (int* flag, char* name) to check if a checkpoint is available for application restart. Users can the use int SCR_Start_restart (char* name) and int SCR_Complete_restart (int valid) to mark the restart phase of the application.
  • SCR now allows for user-defined directories. That is, users are able to dictate the layout of their files on the PFS; SCR no longer requires that all files for a checkpoint exist in the same directory.

Other Changes:

  • We have upgraded our build system to CMake. This includes some preliminary testing available via make test.
  • SCR now supports Cray Datawarp burst buffer architectures! Users can trigger static linking (default on Cray systems) via the CMake option -DSCR_LINK_STATIC=ON.
  • The SCR Spack package has been updated to support all options and configurations, including some smart defaults for Cray machines.
  • We now support building SCR on a Mac.
  • We now have initial implementations of the SCR command line interface for interacting with the LSF and PMIx resource managers.
  • We've changed the behavior of SCR during a restart if SCR_Finalize() was called.

SCR v1.2.0 Release Candidate 1

27 Oct 21:40
Compare
Choose a tag to compare
Pre-release

Pre-release of Version 1.2.0.