Skip to content

Commit

Permalink
update 2024.12.2
Browse files Browse the repository at this point in the history
  • Loading branch information
yuxiaoba committed Dec 2, 2024
1 parent 628443e commit 91747e7
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 3 deletions.
7 changes: 6 additions & 1 deletion LLM Training/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
- 24_NSDI_Characterization of Large Language Model Development in the Datacenter [[paper]](https://www.usenix.org/system/files/nsdi24-hu.pdf)
- 24_Revisiting Reliability in Large-Scale Machine Learning Research Clusters [[paper]](Revisiting Reliability in Large-Scale Machine Learning Research Clusters)
- 24_ATC_SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation [[paper]](https://arxiv.org/abs/2402.06194) [Microsoft] [Best Paper]
- 24_ATC_Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field [[paper]](https://www.usenix.org/conference/atc24/presentation/wu-ronglong)
- 24_NSDI_MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs [[paper]](https://www.usenix.org/system/files/nsdi24-jiang-ziheng.pdf)
- 24_The Llama 3 Herd of Models [[paper]](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) [Meta]
> During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions.
Expand All @@ -13,4 +14,8 @@
## Checkpoint

- 24_Eurosys_Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures [[paper]](https://dl.acm.org/doi/pdf/10.1145/3627703.3650085)
> Most errors during training occur due to failures of a single GPU or network device (either hardware or transient errors), while host/CPU and simultaneous multi-node failures are extremely rare
> Most errors during training occur due to failures of a single GPU or network device (either hardware or transient errors), while host/CPU and simultaneous multi-node failures are extremely rare

## Schedule
24_SC_PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters [[paper]](https://arxiv.org/abs/2408.11919)
5 changes: 3 additions & 2 deletions Root_cause_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,12 @@
- 20_VLDB_VLDB_Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases [[paper]](http://www.vldb.org/pvldb/vol13/p1176-ma.pdf) [[code]](https://github.com/NetManAIOps/DejaVu/blob/master/iSQUAD/iSQ.py)
- 20_IWQoS_Localizing Failure Root Causes in a Microservice through Causality Inference [[paper]](https://ieeexplore.ieee.org/document/9213058)
- 20_NOMS_MicroRCA: Root Cause Localization of Performance Issues in Microservices [[paper]](https://ieeexplore.ieee.org/document/9110353) [[code]](https://github.com/elastisys/MicroRCA)
- 20_ICSE_Debugging Crashes using Continuous Contrast Set Mining
- 16_ICSE_iDice: Problem Identification for Emerging Issues [[paper]](http://hongyujohn.github.io/iDice.pdf)
- 20_ICSE_Debugging Crashes using Continuous Contrast Set Mining [[paper]](https://arxiv.org/abs/1911.04768)
- 19_WWW_ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Smallwindow Long-tail Latency in Large-scale Microservice Platforms [[paper]](https://monadyn.github.io/Papers/p3215-shan.pdf) [[code]](https://github.com/salesforce/PyRCA)
- 18_ICSOC_Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments [[paper]](https://link.springer.com/chapter/10.1007/978-3-030-03596-9_1)
- 18_OSDI_Orca: Differential Bug Localization in Large-Scale Services [[paper]](https://www.usenix.org/conference/osdi18/presentation/bhagwan) Awarded Best Paper! 👍
- 14_NSDI_Adtributor: Revenue Debugging in Advertising Systems [[paper]](https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-bhagwan.pdf)
- 16_ICSE_iDice: Problem Identification for Emerging Issues [[paper]](http://hongyujohn.github.io/iDice.pdf)

## Log

Expand Down

0 comments on commit 91747e7

Please sign in to comment.