From 0933e1ba236c014f7df362a2110b7f09873bc1f5 Mon Sep 17 00:00:00 2001 From: Jacek Bieniusiewicz Date: Tue, 17 Dec 2024 18:33:56 +0100 Subject: [PATCH] updated FT and Straggler Det. docs; added version on the main page --- docs/source/fault_tolerance/api/callback.rst | 1 - docs/source/fault_tolerance/api/client.rst | 1 - docs/source/fault_tolerance/api/config.rst | 2 +- docs/source/fault_tolerance/api/server.rst | 1 - docs/source/fault_tolerance/examples.rst | 1 + .../fault_tolerance/examples/train_ddp.rst | 6 ++ docs/source/fault_tolerance/usage_guide.rst | 68 ++++++++----------- docs/source/index.rst | 2 + docs/source/straggler_det/api/callback.rst | 3 +- docs/source/straggler_det/api/reporting.rst | 2 +- docs/source/straggler_det/api/statistics.rst | 2 +- docs/source/straggler_det/api/straggler.rst | 1 - 12 files changed, 43 insertions(+), 47 deletions(-) create mode 100644 docs/source/fault_tolerance/examples/train_ddp.rst diff --git a/docs/source/fault_tolerance/api/callback.rst b/docs/source/fault_tolerance/api/callback.rst index b051a79..f811606 100644 --- a/docs/source/fault_tolerance/api/callback.rst +++ b/docs/source/fault_tolerance/api/callback.rst @@ -3,5 +3,4 @@ Callback .. automodule:: nvidia_resiliency_ext.ptl_resiliency.fault_tolerance_callback :members: - :undoc-members: :show-inheritance: diff --git a/docs/source/fault_tolerance/api/client.rst b/docs/source/fault_tolerance/api/client.rst index 69196ef..708cdf1 100644 --- a/docs/source/fault_tolerance/api/client.rst +++ b/docs/source/fault_tolerance/api/client.rst @@ -3,5 +3,4 @@ Client .. automodule:: nvidia_resiliency_ext.fault_tolerance.rank_monitor_client :members: - :undoc-members: :show-inheritance: \ No newline at end of file diff --git a/docs/source/fault_tolerance/api/config.rst b/docs/source/fault_tolerance/api/config.rst index 00e8d10..3c6529d 100644 --- a/docs/source/fault_tolerance/api/config.rst +++ b/docs/source/fault_tolerance/api/config.rst @@ -3,5 +3,5 @@ Config .. automodule:: nvidia_resiliency_ext.fault_tolerance.config :members: - :undoc-members: :show-inheritance: + :no-index: \ No newline at end of file diff --git a/docs/source/fault_tolerance/api/server.rst b/docs/source/fault_tolerance/api/server.rst index c40289e..ec8e6e6 100644 --- a/docs/source/fault_tolerance/api/server.rst +++ b/docs/source/fault_tolerance/api/server.rst @@ -3,5 +3,4 @@ Server .. automodule:: nvidia_resiliency_ext.fault_tolerance.rank_monitor_server :members: - :undoc-members: :show-inheritance: diff --git a/docs/source/fault_tolerance/examples.rst b/docs/source/fault_tolerance/examples.rst index 04d55b4..18e582c 100644 --- a/docs/source/fault_tolerance/examples.rst +++ b/docs/source/fault_tolerance/examples.rst @@ -6,3 +6,4 @@ Examples :caption: Examples examples/basic_example.rst + examples/train_ddp.rst \ No newline at end of file diff --git a/docs/source/fault_tolerance/examples/train_ddp.rst b/docs/source/fault_tolerance/examples/train_ddp.rst new file mode 100644 index 0000000..66299ef --- /dev/null +++ b/docs/source/fault_tolerance/examples/train_ddp.rst @@ -0,0 +1,6 @@ +DDP usage example +================== + +.. literalinclude:: ../../../../examples/fault_tolerance/train_ddp.py + :language: python + :linenos: \ No newline at end of file diff --git a/docs/source/fault_tolerance/usage_guide.rst b/docs/source/fault_tolerance/usage_guide.rst index 5eedd9b..267b3ec 100644 --- a/docs/source/fault_tolerance/usage_guide.rst +++ b/docs/source/fault_tolerance/usage_guide.rst @@ -41,21 +41,41 @@ FT Package Design Overview FT Integration Guide for PyTorch ******************************** -Prerequisite: -============= +1. Prerequisites: +================= Run ranks using ``ft_launcher``. The command line is mostly compatible with ``torchrun``. -FT configuration is passed to ``ft_launcher`` via YAML file ``--fault-tol-cfg-path`` or CLI arguments (``--ft-param-...``). + +.. note:: + Some clusters (e.g. SLURM) use SIGTERM as a default method of requesting a graceful workload shutdown. + It is recommended to implement appropriate signal handling in a fault-tolerant workload. + To avoid deadlocks and other unintended side effects, signal handling should be synchronized across all ranks. + Please refer to the :doc:`train_ddp.py example ` for a basic signal handling implementation. + + +2. FT configuration: +==================== + +FT configuration is passed to ``ft_launcher`` +via YAML file ``--fault-tol-cfg-path`` or CLI arguments ``--ft-param-...``, +from where it's propagated to other FT components. + Timeouts for fault detection need to be adjusted for a given workload: * ``initial_rank_heartbeat_timeout`` should be long enough to allow for workload initialization. * ``rank_heartbeat_timeout`` should be at least as long as the longest possible interval between steps. -Integration with a workload: -============================ -1. Initialize a ``RankMonitorClient`` instance on each rank with ``RankMonitorClient.init_workload_monitoring()``. -2. *(Optional)* Restore the state of ``RankMonitorClient`` instances using ``RankMonitorClient.load_state_dict()``. +**Importantly, heartbeats are not sent during checkpoint loading and saving**, so time for checkpointing-related operations should be taken into account. + +Summary of all FT configuration items: + +.. autoclass:: nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig + + +3. Integration with a PyTorch workload: +======================================= +1. Initialize a ``RankMonitorClient`` instance on each rank with ``RankMonitorClient.init_workload_monitoring()``. +2. *(Optional)* Restore the state of ``RankMonitorClient`` instances using ``RankMonitorClient.load_state_dict()``. 3. Periodically send heartbeats from ranks using ``RankMonitorClient.send_heartbeat()``. 4. *(Optional)* After a sufficient range of heartbeat intervals has been observed, call ``RankMonitorClient.calculate_and_set_timeouts()`` to estimate timeouts. - **Note:** Operations such as checkpoint loading and saving might result in longer intervals between heartbeats. 5. *(Optional)* Save the ``RankMonitorClient`` instance's ``state_dict()`` to a file so that computed timeouts can be reused in the next run. 6. Shut down ``RankMonitorClient`` instances using ``RankMonitorClient.shutdown_workload_monitoring()``. @@ -68,12 +88,9 @@ This section describes Fault Tolerance integration with a PTL-based workload (i. 1. Use ``ft_launcher`` to start the workload ============================================ -Fault tolerance relies on a special launcher (``ft_launcher``), which is a modified ``torchrun``. The FT launcher runs background processes called rank monitors. -**You need to use ft_launcher to start your workload if you are using FT**. -For example, the `NeMo-Framework-Launcher `_ can be used to generate SLURM batch scripts with FT support. +Fault tolerance relies on a special launcher (``ft_launcher``), which is a modified ``torchrun``. +If you are using NeMo, the `NeMo-Framework-Launcher `_ can be used to generate SLURM batch scripts with the FT support. -``ft_launcher`` is similar to ``torchrun`` but it starts a rank monitor for each started rank. ``ft_launcher`` takes the FT configuration in a YAML file (``--fault-tol-cfg-path``) or via CLI args (``--ft-param-...``). -FT configuration items are described in the :class:`FaultToleranceConfig ` docstring. 2. Add FT callback to the PTL trainer ===================================== @@ -128,28 +145,3 @@ The following mechanism can be used to implement an auto-resuming launcher scrip * If ``FAULT_TOL_FINISHED_FLAG_FILE`` exists, the auto-resume loop can be broken, as the training is completed. * If ``FAULT_TOL_FINISHED_FLAG_FILE`` does not exist, the continuation job can be issued (other conditions can be checked, e.g., if the maximum number of failures is not reached). - -4. FT configuration -=================== - -FT configuration is passed to ``ft_launcher`` via a YAML file or CLI args, from where it's propagated to other FT components. - -Timeouts for fault detection need to be adjusted for a given workload: - * ``initial_rank_heartbeat_timeout`` should be long enough to allow for workload initialization. - * ``rank_heartbeat_timeout`` should be at least as long as the longest possible interval between steps. - -**Importantly, heartbeats are not sent during checkpoint loading and saving**, so time for checkpointing-related operations should be taken into account. - -If ``calculate_timeouts: True``, timeouts will be automatically estimated based on observed intervals. Estimated timeouts take precedence over timeouts defined in the config file. -**Timeouts are estimated at the end of a training run when checkpoint loading and saving were observed**. Hence, in a multi-part training started from scratch, -estimated timeouts won't be available during the initial two runs. Estimated timeouts are stored in a separate JSON file. - -``max_subsequent_job_failures`` allows for the automatic continuation of training on a SLURM cluster. This feature requires the SLURM job to be scheduled with ``NeMo-Framework-Launcher`` or other compatible launcher framework. -If ``max_subsequent_job_failures`` value is `>0`, a continuation job is prescheduled. It will continue the work until ``max_subsequent_job_failures`` subsequent jobs fail (SLURM job exit code is `!= 0`) -or the training is completed successfully ("end of training" marker file is produced by the ``FaultToleranceCallback``, i.e., due to iterations or time limit reached). - -Summary of all FT configuration items: - -.. autoclass:: nvidia_resiliency_ext.fault_tolerance.config.FaultToleranceConfig - :members: - :exclude-members: from_args, from_kwargs, from_yaml_file, to_yaml_file \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index e0b8e0f..d5ba702 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -3,6 +3,8 @@ nvidia-resiliency-ext **nvidia-resiliency-ext** is a set of tools developed by NVIDIA to improve large-scale distributed training resiliency. +**Documentation for version 0.2.0** + Features -------- diff --git a/docs/source/straggler_det/api/callback.rst b/docs/source/straggler_det/api/callback.rst index a6e47fa..632c0b4 100644 --- a/docs/source/straggler_det/api/callback.rst +++ b/docs/source/straggler_det/api/callback.rst @@ -3,5 +3,4 @@ Callback .. automodule:: nvidia_resiliency_ext.ptl_resiliency.straggler_det_callback :members: - :undoc-members: - :show-inheritance: \ No newline at end of file + :show-inheritance: diff --git a/docs/source/straggler_det/api/reporting.rst b/docs/source/straggler_det/api/reporting.rst index 074de3c..53191e4 100644 --- a/docs/source/straggler_det/api/reporting.rst +++ b/docs/source/straggler_det/api/reporting.rst @@ -3,6 +3,6 @@ Reporting .. automodule:: nvidia_resiliency_ext.straggler.reporting :members: - :undoc-members: :show-inheritance: + diff --git a/docs/source/straggler_det/api/statistics.rst b/docs/source/straggler_det/api/statistics.rst index c75ba87..54f6d77 100644 --- a/docs/source/straggler_det/api/statistics.rst +++ b/docs/source/straggler_det/api/statistics.rst @@ -3,6 +3,6 @@ Statistics .. automodule:: nvidia_resiliency_ext.straggler.statistics :members: - :undoc-members: :show-inheritance: + diff --git a/docs/source/straggler_det/api/straggler.rst b/docs/source/straggler_det/api/straggler.rst index 3899057..49caab1 100644 --- a/docs/source/straggler_det/api/straggler.rst +++ b/docs/source/straggler_det/api/straggler.rst @@ -3,5 +3,4 @@ Straggler .. automodule:: nvidia_resiliency_ext.straggler.straggler :members: - :undoc-members: :show-inheritance: