Nvidia-DL-Framework-Inspect

This tool facilitates debugging convergence issues and testing new algorithms/recipes for training LLMs using Nvidia libraries: Transformer Engine, Megatron-LM, and NeMo. It is compatible with any PyTorch model and/or any combination of our libraries (e.g., Megatron-LM/NeMo + Transformer Engine or Transformer Engine/Megatron/Nemo standalone).

Users can leverage configuration files to enable or disable specific features by selecting layers based on their names. This allows for a more tailored debugging experience, as users can add framework-specific features and only use the features that are necessary, without having to load all debug features.

Additionally, APIs are provided for processing tensors and collecting tensor statistics. These APIs can be easily customized based on user specifications, providing a flexible and adaptable debugging solution.

Getting Started

Installation

Please check the installation doc

Initialization

To initialize, use the initialize API in a global context. For multi-GPUs, initialize once on every rank.

# import the tool
import nvdlfw_inspect.api as debug_api

# Initialize the debug API
debug_api.initialize()

Configuration Files

The config file contains all the features that will be enabled in the run. The config is structured as follows:

config_name: # Config name. Used to differentiate between configs in the same file.
  enabled: True
  layers: # This field is used for layer selection.
    layer_name_regex_pattern: .*(fc1|qkv) # In this config, all layers ending with fc1 or qkv are selected.
  LogTensorStats: # Enabling a feature class called `LogTensorStats`.
    enabled: True
    type: [mean, std, l1_norm] # Type of statistics to log
    tensors: [activation, weight, gradient] # statistics collected from these tensors

To initialize a configuration file, use the config_file argument when initializing debug.

debug_api.initialize(config_file="debug_config.yaml")

To learn more about the structure of the config file, check out docs/config_file.md or look through some example configs in the examples/configs/ directory.

Loading Specific Features

Specific debug features can be loaded using feature_dirs argument while initializing debug. This will register all features and framework APIs found under the directory. By default, debug tool will just load generic features found in debug_features/.

# Example to load transformer_engine specific features
debug_api.initialize(config_file="debug_config.yaml", feature_dirs=["/path/to/transformer_engine/debug/features"])

Available Features

To list features, run the following after initialization. This gives the list of features that can be enabled through the configuration file

debug_api.list_features()

To know more about each of these features, run the following after initialization.

debug_api.explain_features(features)

the argument can be a string or list with the feature names. To print all the features, you can pass the string 'all'.

API Usage

All APIs are defined under a namespace. Generic features that are framework-agnostic contain generic APIs that can be accessed as:

debug_api.log_tensor_stats(layer_name, tensor=weight, tensor_name="weight")

Whereas, framework specific features contain framework specific APIs that can be accessed as:

# Even though the API name is the same, it uses transformer_engine specific API to log transformer_engine specific tensor statistics.
debug_api.transformer_engine.log_tensor_stats(layer_name, tensor=weight, tensor_name="weight")

Documentation

Please read the following docs for more information on adding new features, namespaces and APIs.

Getting Started: docs/getting_started.md
Logging: docs/logging.md
Config: docs/config_file.md
Debug Features: docs/features.md
Namespaces: docs/namespaces.md

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
examples		examples
nvdlfw_inspect		nvdlfw_inspect
scripts		scripts
tests/unit		tests/unit
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-dev-local.txt		requirements-dev-local.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nvidia-DL-Framework-Inspect

Getting Started

Installation

Initialization

Configuration Files

Loading Specific Features

Available Features

API Usage

Documentation

About

Releases

Packages

Contributors 4

Languages

License

NVIDIA/nvidia-dlfw-inspect

Folders and files

Latest commit

History

Repository files navigation

Nvidia-DL-Framework-Inspect

Getting Started

Installation

Initialization

Configuration Files

Loading Specific Features

Available Features

API Usage

Documentation

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages