ReleaseNotes.txt

============================== (Pending) Release Notes: v1.00 ==============================
Support for new training algorithms:

Support for new network structures:

Support for new layers:

Python front-end:

Performance optimizations:

Model portability & usability:

Internal features:

I/O & data readers:

Build system:

Bug fixes:

Retired features:

============================== Release Notes: v0.99 ==============================
Support for new training algorithms:
 - Improvements to LTFB infrastructure (including transfer of SGD and Adam hyperparameters)

Support for new network structures:
 - Support for Wide ResNets

Support for new layers:

Python front-end:
 - Python front-end for generating neural network architectures (lbann namespace):
   including layers, objective functions, callbacks, metrics, and optimizers.
 - Python interface for launching (SLURM or LSF) jobs on HPC systems
 - Support for running LBANN experiments and capturing experimental output
 - Network templates for AlexNet, LeNet, arbitrary ResNet models, and Wide ResNet models
 - Python scripts for LeNet, AlexNet, and (Wide) ResNets in model zoo.

Performance optimizations:
 - GPU implementation of RMSprop optimizer.
 - cuDNN convolution algorithms are determined by empirically measuring
   performance rather than using heuristics.
 - Avoid setting up unused bias weights.
 - Perform gradient accumulations in-place when possible.

Model portability & usability:

Internal features:
 - Weight gradient allreduces are in-place rather than on a staging buffer.
 - Fully connected and convolution layers only create bias weights when
   needed.
 - Optimizer exposes gradient buffers so they can be updated in-place.
 - Added callback support to explicitly save model
 - Min-max metric for reporting on multiple LTFB trainers
 - Cleanup of Hydrogen interface to match Hydrogen v1.2.0
 - Added type-erased matrix class for internal refactoring
 - Make CUB always log performance critical events

I/O & data readers:
 - Python data reader that interacts with an embedded Python session.
 - Optimized data store to provide preload option
 - Extended data store to operate with Cosmoflow-numpy data reader

Build system:
 - Added documentation for how users can use Spack to install LBANN
   either directly or via environments.
 - Conduit is a required dependency.
 - Provided Spack environment for installing LBANN as a user
 - Improved documentation on lbann.readthedocs.io
 - CMake installs a module file in the installation directory that
   sets up PATH and PYTHONPATH variables appropriately

Bug fixes:
 - Models can now be copied or setup multiple times.
 - Fixed incorrect weight initialization with multiple trainers.
 - Updated I/O random number generators to be C++ thread safe (rather than OpenMP)
 - Added an I/O random number generator for preprocessing that is independent
   of the data sequence RNG.
 - Fixed initialization order of RNGs and multiple models / trainers.
 - General fixes for I/O and LTFB interaction.

Retired features:
 - "Zero" layer (hack for early GAN implementation).
 - Removed data reader specific implementations of data store (in favor of Conduit-based
   data store)

============================== Release Notes: v0.98.1 ==============================
Bug Fixes:
 - Added missing header

============================== Release Notes: v0.98 ==============================
Support for new training algorithms:
 - Hyperparameter exploration with Adam optimizers
 - LTFB can perform inter-trainer communication via checkpoint files

Support for new network structures:
 - Wassertein autoencoder

Support for new layers:
 - Squared difference
 - Tessellate
 - Clamp

Performance optimizations:
 - Added support for node-local batch normalization

Model portability & usability:
 - Added prototype Python front end for generating model prototext files
   that is inspired by PyTorch's interface
 - Created Python library of networks and modules used for prototext
   generation
 - Support for exporting and importing models in ONNX format
 - Output dumping callback exports in CSV, TSV, .npy, or .npz formats
 - Added dedicated inference front end

Internal features:
 - Expanded layer documentation
 - Utility class for nicely formatted descriptions
 - Switched to using ReadTheDocs for documentation which uses a
   combination of doxygen, breathe, and sphinx
 - Provided distinction between trainer and model objects
 - Added a generic factory template
 - Refactored front-end functionality into library class

I/O & data readers:
 - Overhauled the I/O system to use an independent background thread
   pool for fetching data
 - Added support for data set metadata file that provides both schema
   and normalization values unique to a given data set.  Demonstrated
   use in JAG Conduit data reader.
 - Added support for an index list based approach for describing the
   samples to use in the training and testing.  Note that this is
   currently only supported in the JAG Conduit data reader
 - Create a general-purpose data store that operates on generic
   Conduit node data structures.  This should provide an extensible
   and generic approach for holding and exchanging data between
   epochs.  Note that this is currently only supported in the JAG
   Conduit data reader.

Build system:
 - Support for using Spack environments feature when building

Retired features:
 - Removed deprecated objective functions and target layer
 - Removed distributed I/O buffer layer it has been deprecated by the
   background I/O threads

============================== Release Notes: v0.97.1 ==============================
Bug Fixes:
 - Removed deprecated header file include

============================== Release Notes: v0.97 ==============================
Support for new layers:
 - Mean absolute error and L1 norm
 - GPU implementation for activation layers
 - Log sigmoid and softsign
 - Channel-wise mean (temporary kludge)

Model portability & usability:
 - Hints for layer output dimensions
 - Confusion matrix callback
 - Metric checking callback

Internal features:
 - Removed target-layer-based features from model zoo
 - Layer unit tests check for expected output values

Retired features:
 - Smooth ReLU, bent identity, and swish layers
 - Target-layer-based metrics
 - Target-layer-based models (sequential, greedy layer-wise autoencoder, Siamese)

============================== Release Notes: v0.96 ==============================
Support for new layers:
 - Log softmax
 - Basic math functions
 - Weights layer, which outputs a weights tensor
 - L2 norm squared
 - Binary cross entropy loss and sigmoid binary cross entropy loss
 - Boolean accuracy, Boolean false negative rate, Boolean false positive rate
 - Bilinear resize
 - Variance and covariance
 - Dilated and grouped convolution (GPU only)

Performance optimizations:
 - Optimized GPU model-parallel softmax layer

Model portability & usability:
 - Option for weight initialization with user-provided list of values
 - Callback to save any layer output as an image

Internal features:
 - Provide compile time option to selectively disable OpenMP for data fetching loop
 - Thrust calls no longer involve the default CUDA stream

I/O & data readers:
 - Reworked jag_conduit data reader:
   - Support the updated JAG simulation data output format
   - Use direct HDF5 I/O for on-demand data loading with Conduit
   - Ingest a unique set of data files per instance
   - Allow exclusive data partitioning among multiple trainers
   - Multi-channel images
   - Normalization of JAG data
   - Interface to select images of specific views and time indices
   - Interface to describe how to slice JAG data
   - Avoid redundant fetching and incoherent random number pulls in the group of local data readers
 - Improved threading performance by preallocating scratch space for loading samples

Build system:
 - Support cross-compilation configurations in superbuild and SetupProtobuf

============================== Release Notes: v0.95 ==============================
Support for new training algorithms:
  - Generative Adversarial Networks (GAN)

Support for new network structures:
  - Variational Autoencoders
  - GAN
  - CycleGAN
  - Combined Autoencoders with CycleGAN
  - Deep Recurrent Attention Model (DRAM), Ba et al. (2015)
  - Video Recurrent Attention Model (VRAM)

Support for new layers:
  - Optimized Top-K accuracy (CPU, GPU)
  - Crop (CPU, GPU)
  - Sort (CPU, GPU) both ascending and descending order
  - Absolute value (CPU, GPU)
  - Mean-squared (CPU, GPU)
  - Top-K categorical accuracy (CPU, GPU)
  - Cross-entropy (CPU, GPU)
  - Stop gradient (CPU, GPU)

Performance optimizations:
  - Use Pinned memory for CPU activations matrices
  - Non-blocking GPU computation of objective functions and metrics
  - Refactored weight matrices and weight initialization
  - Manage GPU workspace buffers with memory pool
  - Slice and concatenation layer emit matrix views if possible
  - Used more fine-grained asynchronous calls when using Aluminum Library
    - Minimized GPU stream synchronization events per call
  - Improved / minimized synchronization events when using a single GPU
  - Fixed GPU workspace size
  - GPU implementation of Adagrad optimizer
  - GPU model-parallel softmax
  - Optimized local CUDA kernel implementations
  - Support for distributed matrices with arbitrary alignment

Model portability & Usability:
  - Keras to LBANN prototext conversion tool

Internals Features:
  - Support for multiple objective functions and metrics per network with arbitrary placement
    - Objective functions represented as layers
    - Metrics represented as layers
    - Introduced evaluation layer construct
  - Ability to freeze specific layers for pre-training / fine-tuning
  - Refactoring tensor setup in setup, forward prop, and back prop
  - Layers store matrices in private smart pointers
  - Model automatically inserts evaluation layers where needed
  - Copy Layer activations between models
  - Annotated GPU profiling output with training phases
  - Fixed initialization of Comm object and Grid objects when using multiple models
  - General code cleanup, refactoring, and various bug fixes.
  - All layers overwrite error signal matrices
  - NCCL backend is now implemented via Aluminum Library
  - MPI calls are routed through the LBANN Comm object into Hydrogen or Aluminum
  - Provide runtime statistics summary from every rank
  - Reworked LBANN to use Hydrogen to manage GPU memory
  - GPU allocations now via CUB memory pool
  - Fixed Spack build interaction with Hydrogen Library

I/O & data readers:
  - Support for Conduit objects with HDF5 formatting
  - In-memory and locally offloaded data store
    - Data Store can hold the entire training set in memory (or node-local storage)
    - Data store will shuffle data samples between epochs and present samples to input layer
  - Updated synthetic data reader
  - Modified data readers to handle bad samples in JAG conduit data
  - Reworked the I/O layers (input and target) so that the input layer produces both the
    sample and label / response if necessary.
    - Target layer is being deprecated
  - Updated image data reader to use cv::imdecode to accelerate image load times
  - Allow users to specify an array of data sources for the independent/dependent
    variables via prototext

============================== Release Notes: v0.94 ==============================
Support for new training algorithms:
  - Back-Propagation Through Time (BPTT)
    -- Recurrent Neural Networks (RNN)
    -- Long Short-Term Memories (LSTM)
  - Generative Adversarial Networks (GAN)
  - Variational autoencoders
  - Convolutional autoencoders
  - Fine tuning of pretrained networks
    -- Flexible weight freezing
  - Context-prediction network (Siamese network)
  - Livermore Tournament Fast Batch learning (LTFB)
  - Variable mini-batch sizes

Support for new network structures
  - Directed Acyclic Graph (DAG) networks
  - Residual networks
  - Modular and composable objective functions
  - Multiple metrics
  - Shared weight matrices
  - (BETA) New evaluation layer that is attach to any point of DAG
  - Motifs (compound, reused network patterns)

Support for new layers:
  - Learning:
    - Deconvolution
  - Metrics:
    -- Top K Categorical accuracy, Pearson correlation, Mean absolute deviation
  - Loss Functions:
    -- Cross Entropy with Uncertainty, Geometric negative log likelihood
    -- Poisson Negative log likelihood, Polya Negative Log Likelihood
  - Optimizers:
    -- Hypergradient Adam
  - Transform Layers:
    -- Contatenation, Noise, Unpooling, Pooling, Reshape, Slice, Split, Sum
  - Regularizer:
    -- Batch Normalization, Selu Dropout, Local Response Normalization (LRN)
  - Activations:
    -- Leaky Relu, Smooth Relu, Elu, Scaled Elu, Softplus, Atan,
    -- Bent Identity, Exponential

Performance optimizations:
  - GPU acceleration for most layers
  - NCCL 2.X
  - Optimized communication patterns
  - Asynchronous weight updates
  - Asynchronous metric and objective function updates
  - batch normalization (global and local)
  - L2 normalization
  - Adaptive Quantization (inter-model)

Model portability & usability:
  - Portable checkpoints / recovery
  - Distributed checkpoint / recovery
  - Network visualization
  - Export LBANN to TensorFlow format

Internals Features:
  - Gradient checking
  - Network representation using tensor dimensions
  - Bamboo continuous integration (CI)
  - Improved data processing pipeline

New data readers:
 - Numpy
 - CSV
 - Methods for merging multiple features and samples across files
 - CANDLE Pilot 2
 - CANDLE Pilot 1 Combo
 - ICF JAG

Integration with Hydrogen, an optimized distributed, dense linear algebra
library.  Hydrogen is a fork of the Elemental library.  Hydrogen optimizes for:
distributed matrices with elemental and block distributions, BLAS, LAPACK,
distributed and local matrix management.

Integration with optimized all-reduce communication library Aluminum.  Aluminum
provides custom reduction patterns, customized CUDA reduction kernels,
and asynchronous communication operators. It uses MPI, MPI w/GPUdirect, or NCCL
as back-end libraries. Aluminum enables us to effectively use non-blocking
all-reduces during backprop/optimization

Additionally, we have added support for an online, distributed data store.  When
enabled, LBANN is able to ingest all of the training data set in a distributed
method across all ranks.  Each data store is then able to serve it's portion of
a mini-batch, dynamically moving data to the necessary ranks in the model (based
on the mini-batch data distribution).

============================== Release Notes: v0.93 ==============================
This release contains a major refactoring / overhaul of the code base.
Key highlights include:
- Moving layer design into smaller simpler layers that have a single
  compute behavior per layer.  Specifically, linear combination of the
  inputs, non-linear activations, and regularizers now exist as their
  own layers.
- Layers now have a template parameter that specifies the data layout
  for the distributed matrices.
- Prototext interface for specifying neural network models and data
  readers is nearly fully functional.
- Code now adheres to internal coding style as outlined in
  README_coding_style.txt
- Dead-code has been eliminated and layer file hierarchy has been
  cleaned up.

============================== Release Notes: v0.92 ==============================
New features include (but are not limited to):
  - Full support for convolutional and pooling layers
  - GPU acceleration of local Elemental GEMM operations
  - Improved network and data reader support
    -- Alexnet
    -- VGG
    -- CIFAR-10
  - Added a suite of regularizers, objective functions, and metrics, including:
    -- Batch normalization
    -- Drop-out
    -- L2
  - Dramatically improves the performance of inter-model communication
  - Added suite of image prepossessing routines

============================== Release Notes: v0.91 ==============================
Incorporates a number of changes through the LBANN code base.  In
particular there is a new build system that tries to have LBANN
download all of the dependencies into its build tree, and compile them
locally.  Additional improvements include optimizations in the data
parallel, multiple model training framework, support for convolutional
layers, and general bug fixes.

============================== Release Notes: v0.90 ==============================
Initial release of the LBANN toolkit.