You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You should observe that the process with replica group id 1 will exit early, and the process with replica group id 0 will quickly resume training. If the same script is ran with after setting `export TORCHFT_PROACTIVE_RECOVERY=0`, you should observe that the process with replica group id 1 will hang for dozens of seconds before continuing.
280
-
281
248
### Example Parameter Server
282
249
283
250
torchft has a fault tolerant parameter server implementation built on it's
This directory contains advanced examples demonstrating various fault tolerance features and training approaches in TorchFT beyond the basic `train_ddp.py` example in the [README](../README.md).
4
+
5
+
Each directory contains a README with more detailed instructions, as well as extensive documentation on the feature being showcased and how to interpret the outputs.
6
+
7
+
## List of Examples
8
+
9
+
-[DDP with proactive failure recovery](./ddp_proactive/README.md): Demonstrates DDP with proactive failure recovery mode
10
+
-[DiLoCo](./diloco/README.md): Demonstrates Distributed Local Convergence training
11
+
-[LocalSGD](./localsgd/README.md): Demonstrates Local SGD with periodic synchronization
12
+
-[Live Checkpoint Recovery](./live_checkpoint_recovery/README.md): Demonstrates live checkpoint recovery
the QUICK_RUN environment variable runs the examples for much less steps, and also uses a synthetic, rather than downloaded, dataset. It is useful for testing the examples quickly.
36
+
37
+
See the `.torchxconfig` file in each example directory for configuration details, and [torchx.py](../torchft/torchx.py) and the [torchX documentation](https://pytorch.org/torchx/latest/) to understand how DDP is being ran.
0 commit comments