From 91747e7b8965067d1beba7e372f0c8c0a9b5c020 Mon Sep 17 00:00:00 2001 From: yuxiaoba <444723257@qq.com> Date: Mon, 2 Dec 2024 09:55:49 +0800 Subject: [PATCH] update 2024.12.2 --- LLM Training/README.md | 7 ++++++- Root_cause_analysis/README.md | 5 +++-- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/LLM Training/README.md b/LLM Training/README.md index 654cfc3..3309fe7 100644 --- a/LLM Training/README.md +++ b/LLM Training/README.md @@ -4,6 +4,7 @@ - 24_NSDI_Characterization of Large Language Model Development in the Datacenter [[paper]](https://www.usenix.org/system/files/nsdi24-hu.pdf) - 24_Revisiting Reliability in Large-Scale Machine Learning Research Clusters [[paper]](Revisiting Reliability in Large-Scale Machine Learning Research Clusters) - 24_ATC_SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation [[paper]](https://arxiv.org/abs/2402.06194) [Microsoft] [Best Paper] +- 24_ATC_Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field [[paper]](https://www.usenix.org/conference/atc24/presentation/wu-ronglong) - 24_NSDI_MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs [[paper]](https://www.usenix.org/system/files/nsdi24-jiang-ziheng.pdf) - 24_The Llama 3 Herd of Models [[paper]](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) [Meta] > During a 54-day snapshot period of pre-training, we experienced a total of 466 job interruptions. @@ -13,4 +14,8 @@ ## Checkpoint - 24_Eurosys_Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures [[paper]](https://dl.acm.org/doi/pdf/10.1145/3627703.3650085) - > Most errors during training occur due to failures of a single GPU or network device (either hardware or transient errors), while host/CPU and simultaneous multi-node failures are extremely rare \ No newline at end of file + > Most errors during training occur due to failures of a single GPU or network device (either hardware or transient errors), while host/CPU and simultaneous multi-node failures are extremely rare + + +## Schedule +24_SC_PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters [[paper]](https://arxiv.org/abs/2408.11919) \ No newline at end of file diff --git a/Root_cause_analysis/README.md b/Root_cause_analysis/README.md index 9cb62a8..669f183 100644 --- a/Root_cause_analysis/README.md +++ b/Root_cause_analysis/README.md @@ -32,11 +32,12 @@ - 20_VLDB_VLDB_Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases [[paper]](http://www.vldb.org/pvldb/vol13/p1176-ma.pdf) [[code]](https://github.com/NetManAIOps/DejaVu/blob/master/iSQUAD/iSQ.py) - 20_IWQoS_Localizing Failure Root Causes in a Microservice through Causality Inference [[paper]](https://ieeexplore.ieee.org/document/9213058) - 20_NOMS_MicroRCA: Root Cause Localization of Performance Issues in Microservices [[paper]](https://ieeexplore.ieee.org/document/9110353) [[code]](https://github.com/elastisys/MicroRCA) -- 20_ICSE_Debugging Crashes using Continuous Contrast Set Mining -- 16_ICSE_iDice: Problem Identification for Emerging Issues [[paper]](http://hongyujohn.github.io/iDice.pdf) +- 20_ICSE_Debugging Crashes using Continuous Contrast Set Mining [[paper]](https://arxiv.org/abs/1911.04768) +- 19_WWW_ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Smallwindow Long-tail Latency in Large-scale Microservice Platforms [[paper]](https://monadyn.github.io/Papers/p3215-shan.pdf) [[code]](https://github.com/salesforce/PyRCA) - 18_ICSOC_Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments [[paper]](https://link.springer.com/chapter/10.1007/978-3-030-03596-9_1) - 18_OSDI_Orca: Differential Bug Localization in Large-Scale Services [[paper]](https://www.usenix.org/conference/osdi18/presentation/bhagwan) Awarded Best Paper! 👍 - 14_NSDI_Adtributor: Revenue Debugging in Advertising Systems [[paper]](https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-bhagwan.pdf) +- 16_ICSE_iDice: Problem Identification for Emerging Issues [[paper]](http://hongyujohn.github.io/iDice.pdf) ## Log