Fault Tolerance

Content

Fault Tolerance
- Cloud System
- AI System
  - Failure Analysis
    - Mixed layer
    - Service
    - Model
    - Framework
    - Tooolkits
    - Platform
    - Infrastructure
  - Fault Injection
    - Service
    - Model
    - Framework
    - Toolkits
    - Platform
    - Infrastructure

Cloud System

Failure Analysis

-24_DSN_Mutiny! How does Kubernetes fail, and what can we do about it? [paper] [code]

23_Eurosys_Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems [paper] [code]
22_ICSE-SEIP_An Empirical Study on Change-induced Incidents of Online Service Systems [paper]
22_SoCC_How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service [paper] Awarded Best Paper! 👍
22_ISSRE_Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling [paper] [code]
21_SOSP_Understanding and Detecting Software Upgrade Failures in Distributed Systems [paper]
21_DSN_Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes [paper] [data]
20_FSE_Towards Intelligent Incident Management: Why We Need It and How We Make It [paer]
19_ICSE-SEIP_An Empirical Investigation of Incident Triage for Online Service Systems [paper]
18_OSDI_An Analysis of Network-Partitioning Failures in Cloud Systems [paper]
18_FAST_Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems [paper]
17_TPDS_Failure Diagnosis for Distributed Systems Using Targeted Fault Injection [paper]
16_SIGCOM_Taking the Blame Game out of Data Centers Operations with NetPoirot [paper]
13_SoCC_Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems paper

Fault Injection

24_Neural Fault Injection: Generating Software Faults from Natural Language [paper]
24_S&P_Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay [paper] [code]
24_TDSC_MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications [paper]
23_SOSP_Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management [paper] [code]
23_CCS_Phoenix: Detect and Locate Resilience Issues in Blockchain via Context-Sensitive Chaos [paper]
23_Failure Identification Using Model-Implemented Fault Injection with Domain Knowledge-Guided Reinforcement Learning [paper]
23_ICSE_Coverage Guided Fault Injection for Cloud Systems [paper] [code]
23_NSDI_Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker [paper] [code]
23_Eurosys_Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems [paper] [code]
22_OSDI_Automatic Reliability Testing for Cluster Management Controllers [paper] [code]
22_FSE_IBIR: Bug Report driven Fault Injection [paper] [code]
22_ISSRE_SlowCoach Mutating Code to Simulate Performance Bugs [paper]
22_SBES_Towards a Fault Taxonomy for Microservices-Based Applications [paper]
21_PPoPP_Understanding a Program’s Resiliency Through Error Propagation [paper]
20_ASE_CoFI: Consistency-Guided Fault Injection for Cloud Systems [paper] [code]
20_ISSRE_How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method [paper]
20_DSN_ProFIPy: Programmable Software Fault Injection as-a-Service [paper]
20_ASE_CoFI: Consistency-Guided Fault Injection for Cloud Systems [paper] [code]
20_ICWS_Fitness-guided Resilience Testing of Microservice-based Applications [paper]
19_HotCloud_Co-evolving Tracing and Fault Injection with Box of Pain [paper]
19_ChaosRCA_Observability and Chaos Engineering on System [paper]
19_Chaos_TRIPLE AGENT- Monitoring, Perturbation and Failure-obliviousness for Automated Resilience Improvement in Java Applications [paper]
18_Chaos_A Program-Aware Fault-Injection Method for Dependability Evaluation Against Soft-Error Using Genetic Algorithm [paper]
18_TDSC_Faultprog: Testing the Accuracy of Binary-Level Software Fault Injection [paper]
16_SoCC_Automating Failure Testing Research at Internet Scale [paper]
16_Survey_Assessing Dependability with Software Fault Injection: A Survey [paper]
15_SIGMOD_Lineage-driven Fault Injection [paper]

Fault Recovery

24_SoCC_Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure [paper]
24_Eurosys_Atlas: Hybrid Cloud Migration Advisor for Interactive Microservices [paper]
22_KDD_NENYA: Cascade Reinforcement Learning for Cost-Aware Failure Mitigation at Microsoft 365 [paper]
22_SoCC_Method Overloading the Circuit [paper] [video]
21_DSN_FIRestarter: Practical Software Crash Recovery with Targeted Library-level Fault Injection [paper] [code]
20_OSDI_Narya: Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions [paper]

AI System

Failure Analysis

Mixed layer

23_ICSE-SEIP_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
23_TOSEM_Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective [paper]
22_FSE_Understanding Performance Problems in Deep Learning Systems [paper] [code]
20_ICSE_An Empirical Study on Program Failures of Deep Learning Jobs [paper]
19_Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads [paper]

Service

21_EDBT_JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models [paper]
23_QRS_Online Data Drift Detection for Anomaly Detection Services based on Deep Learning towards Multivariate Time Series [paper]
22_EMNLP_On the Impact of Temporal Concept Drift on Model Explanations [paper]
19_IOP_An Overview of Overfitting and its Solutions [paper]
23_How is ChatGPT’s behavior changing over time? [paper]
22_ISSRE_LLTFI: Framework Agnostic Fault Injection for Machine Learning Applications (Tools and Artifact Track) [paper] [code]
23_MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications [paper]
22_Systematic literature review on software quality for AI-based software [paper]
21_SWQD_Software Quality for AI: Where We Are Now? [paper]
21_ICSE_Are Machine Learning Cloud APIs Used Correctly? [paper]
23_AI for Cybersecurity: A Study on Machine Learning and DoS Attacks AI Robustness and Bypassing Detection Methods [paper]
23_Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks [paper]
21_Advances in Adversarial Attacks and Defenses in Computer Vision: A Survey [paper]
24_AAAI_Visual Adversarial Examples Jailbreak Aligned Large Language Models [paper]

Model

22_Faults in deep reinforcement learning programs: a taxonomy and a detection approach [paper]
23_ICSE_Data Quality for Software Vulnerability Datasets [paper]
22_Author Correction: Advances, challenges and opportunities in creating data for trustworthy AI [paper]
22_A review: Data pre-processing and data augmentation techniques [paper]
23_VLDB_Data collection and quality challenges in deep learning: a data-centric AI perspective [paper]
20_Impact of fully connected layers on performance of convolutional neural networks for image classification [paper]
20_INMIC_Effects of hidden layers on the efficiency of neural networks [paper]
24_ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [paper]
23_Finding Neurons in a Haystack: Case Studies with Sparse Probing [paper]
19_NeurIPS_Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence [paper]
23_Exploring the Relationship Between Learning Rate, Batch Size, and Epochs in Deep Learning: An Experimental Study [paper]
23_A Comprehensive Overview and Comparative Analysis on Deep Learning Models: CNN, RNN, LSTM, GRU [paper]
22_Activation functions in deep learning: A comprehensive survey and benchmark [paper]
22_A comprehensive survey on regularization strategies in machine learning [paper]
21_Comparison of optimization techniques based on gradient descent algorithm: A review [paper]
20_A comprehensive survey of loss functions in machine learning [paper]
22_Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts [paper]

Framework

24_ESE_Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow [paper] [code]
23_ICPC_Understanding Bugs in Multi-Language Deep Learning Frameworks [paper]
23_TOSEM_Toward Understanding Deep Learning Framework Bugs [paper]
22_ICSME_An Empirical Study on Performance Bugs in Deep Learning Frameworks [paper]
20_ICSE_Taxonomy of Real Faults in Deep Learning Systems [paper]
19_FSE_A Comprehensive Study on Deep Learning Bug Characteristics [paper]
18_FSE_An Empirical Study on TensorFlow Program Bugs [paper]
24_NSDI_Characterization of Large Language Model Development in the Datacenter [paper]
22_INFSOF_A comprehensive empirical study on bug characteristics of deep learning frameworks [paper]
22_ASE_Towards Understanding the Faults of JavaScript-Based Deep Learning Systems [paper]
20_ICSE_An empirical study on program failures of deep learning jobs [paper]
12_ISSRE_An Empirical Study of Bugs in Machine Learning Systems [paper]
23_TCAD_Statistical Modeling of Soft Error Influence on Neural Networks [paper]
23_Towards efficient generative large language model serving: A survey from algorithms to systems [paper]
20_DFT_A Pipelined Multi-Level Fault Injector for Deep Neural Networks [paper]
20_ICSE_An empirical study on program failures of deep learning jobs [paper]
23_ISCE_An Empirical Study on Quality Issues of Deep Learning Platform [paper]

Tooolkits

13_ESOP_Interleaving and lock-step semantics for analysis and verification of GPU kernels [paper]
19_ASE_Automating CUDA Synchronization via Program Transformation [paper]
18_ISSRE_Bugaroo: Exposing Memory Model Bugs in Many-Core Systems [paper]
19_SP_SoK: Sanitizing for Security [paper]
23_PLDI_cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications [paper]
23_FSE_Demystifying Dependency Bugs in Deep Learning Stack [paper]
23_ISCE_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
24_NSDI_Characterization of Large Language Model Development in the Datacenter [paper]
19_ATC_Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads [paper]
14_PLDI_Accurate application progress analysis for large-scale parallel debugging [paper]
04_SC_Assessing Fault Sensitivity in MPI Applications [paper]
16_TECS_CUDA Leaks: A Detailed Hack for CUDA and a (Partial) Fix [paper]
18_PLDI_CURD: a dynamic CUDA race detector [paper]
17_JSA_Evaluation of transient errors in GPGPUs for safety critical applications: An effective simulation-based fault injection environment [paper]
14_DAC_Exploring the Heterogeneous Design Space for both Performance and Reliability [paper]
14_LATW_Implementation and experimental evaluation of a CUDA core under single event effects [paper]
17_PVM_What does fault tolerant deep learning need from MPI? [paper]

Platform

21_ACCESS_Diaspore: Diagnosing Performance Interference in Apache Spark [paper]
15_HPCC-CSS-ICESS_Performance Prediction for Apache Spark Platform [paper]
17_ICWS_Log-based Abnormal Task Detection and Root Cause Analysis for Spark [paper]
23_ICSE-SEIP_An Empirical Study on Quality Issues of Deep Learning Platform [paper]

Infrastructure

21_DSN_Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes [paper]
20_SC_GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability [paper]
17_SC_Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications [paper]
15_SC_Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility [paper]
16_HPCA_A large-scale study of soft-errors on GPUs in the field [paper]
17_MASCOTS_Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities [paper]
15_HPCA_Understanding GPU errors on large-scale HPC systems and the implications for system design and operation [paper]
14_DSN_Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters [paper]
20_ICSE_An empirical study on program failures of deep learning jobs [paper]
23_SYSTOR_Predicting GPU Failures With High Precision Under Deep Learning Workloads [paper]
14_LISAT_Reliability and fault tolerance analysis of FPGA platforms [paper]
16_J_NET_Field Programmable Gate Array Reliability Analysis Using the Dynamic Flowgraph Methodology [paper]
10_TII_Component-Based Safety Analysis of FPGAs [paper]
21_TVLSI_Reliability Evaluation and Analysis of FPGA-Based Neural Network Acceleration System [paper]
24_TNS_Tensor Processing Unit Reliability Dependence on Temperature and Radiation Source [paper]
22_RADECS_Sensitivity of Google’s Tensor Processing Units to High-Energy, Mono-Energetic, and Thermal Neutrons [paper]
22_DATE_Reliability of Google's Tensor Processing Units for Embedded Applications [paper]

Fault Injection

22_SANER_How Do Injected Bugs Affect Deep Learning? [paper] [code]

Service

24_ICSE_Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [paper]
15_ICLR_Explaining and Harnessing Adversarial Examples [paper] [code]
18_ICLR_Towards Deep Learning Models Resistant to Adversarial Attacks [paper] [code]
22_DI-AA: An interpretable white-box attack for fooling deep neural networks [paper]
20_SP_HopSkipJumpAttack: A Query-Efficient Decision-Based Attack [paper] [code]
21_ICML_PopSkipJump: Decision-Based Attack for Probabilistic Classifiers [paper] [code]
20_CVPR_GeoDA: A Geometric Framework for Black-Box Adversarial Attacks [paper] [code]
23_I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models [paper]
23_Promptaid: Prompt exploration, perturbation, testing and iteration using visual analytics for large language models [paper]
23_Prompt Injection attack against LLM-integrated Applications [paper]
24_Goal-guided Generative Prompt Injection Attack on Large Language Models [paper]
20_DSN_PyTorchFI: A Runtime Perturbation Tool for DNNs [paper] [code]
18_ISSRE_TensorFI: A Configurable Fault Injector for TensorFlow Applications [paper] [code]
21_DeepTest_TF-DM: Tool for Studying ML Model Resilience to Data Faults [paper] [code]
21_ISSREW_MindFI: A Fault Injection Tool for Reliability Assessment of MindSpore Applicacions [paper]
18_DAC_Ares: a framework for quantifying the resilience of deep neural networks [paper] [code]
19_SC_BinFI : an efficient fault injector for safety-critical machine learning systems [paper] [code]

Model

18_ISSRE_DeepMutation: Mutation Testing of Deep Learning Systems [paper] [code]
19_ASE_DeepMutation++: A Mutation Testing Framework for Deep Learning Systems [paper]
18_QRS_MuNN: Mutation Analysis of Neural Networks [paper]
21_ISSTA_DeepCrime: Mutation Testing of Deep Learning Systems Based on Real Faults [paper] [code]
22_Towards mutation testing of Reinforcement Learning systems [paper]
23_ICST_Mutation Testing of Deep Reinforcement Learning Based on Real Faults [paper]
22_SETTA_MTUL: Towards Mutation Testing of Unsupervised Learning Systems [paper]

Framework

20_ISSRE_TensorFI: Flexible Fault Injection Framework for TensorFlow Applications [[paper]](CSDL | IEEE Computer Society) [code]
22_TDSC_Fault Injection for TensorFlow Applications [[paper]](Fault Injection for TensorFlow Applications | IEEE Journals & Magazine | IEEE Xplore) [code]
20_Fault Injectors for TensorFlow: Evaluation of the Impact of Random Hardware Faults on Deep CNNs [[paper]]([2012.07037] Fault Injectors for TensorFlow: Evaluation of the Impact of Random Hardware Faults on Deep CNNs) [code]
20_LASCAS_Reliability Evaluation of Compressed Deep Learning Models [paper] [code]
20_DSN_PyTorchFI: A Runtime Perturbation Tool for DNNs [paper] [code]
22_TR_SNIFF: Reverse Engineering of Neural Networks With Fault Attacks [paper]
21_ISSREW_MindFI: A Fault Injection Tool for Reliability Assessment of MindSpore Applicacions [paper]
22_IROS_enpheeph: A Fault Injection Framework for Spiking and Compressed Deep Neural Networks [paper] [code]
23_TC_Fast and Accurate Error Simulation for CNNs Against Soft Errors [paper]
17_ICCAD_Fault injection attack on deep neural network [paper]
22_ISSRE_LLTFI: Framework Agnostic Fault Injection for Machine Learning Applications (Tools and Artifact Track) [paper] [code]
23_ETS_SCI-FI: a Smart, aCcurate and unIntrusive Fault-Injector for Deep Neural Networks [paper]

Toolkits

15_Cluster_Fast Fault Injection and Sensitivity Analysis for Collective Communications [paper]
20_ICSE_Simulee: detecting CUDA synchronization bugs via memory-access modeling [paper] [code]
20_COMPSAC_CUDAsmith: A Fuzzer for CUDA Compilers [paper] [code]
15_SIGPLAN_Many-core compiler fuzzing [paper] [code]
00_IPDPS_FIMD-MPI: a tool for injecting faults into MPI application [paper]

Platform

22_STVR_TRANSMUT-Spark: Transformation mutation for Apache Spark [paper]
23_FSE_Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics [paper]

Infrastructure

17_ISPASS_SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation [paper]
14_ISPASS_GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications [paper]
21_DSN_NVBitFI: Dynamic Fault Injection for GPUs[paper]
16_SC_Understanding Error Propagation in GPGPU Applications [paper]
17_SC_Understanding error propagation in deep learning neural network (DNN) accelerators and applications [paper]
17_HPCA_RadiationInduced Error Criticality in Modern HPC Parallel Accelerators [paper]
18_SBAC-PAD_On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation [paper]
12_ISCA_A defect-tolerant accelerator for emerging high-performance applications [paper]
21_VTS_Combining Architectural Simulation and Software Fault Injection for a Fast and Accurate CNNs Reliability Evaluation on GPUs [paper]
02_DFT_Using run-time reconfiguration for fault injection in hardware prototypes [paper]
12_DFT_Fast single-FPGA fault injection platform [paper]
14_J.MICROEL_A fast, flexible, and easy-to-develop FPGA-based fault injection technique [paper]
01_ATS_FPGA-based fault injection for microprocessor systems [paper]
20_Desgin&Test_Enabling Timing Error Resilience for Low-Power Systolic-Array Based Deep Learning Accelerators [paper]
21_ATS_GPU-Accelerated Timing Simulation of Systolic-Array-Based AI Accelerators [paper]
Lightning: Leveraging DVFS-induced Transient Fault Injection to Attack Deep Learning Accelerator of GPUs [paper]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fault Tolerance

Cloud System

Failure Analysis

Fault Injection

Fault Recovery

AI System

Failure Analysis

Mixed layer

Service

Model

Framework

Tooolkits

Platform

Infrastructure

Fault Injection

Service

Model

Framework

Toolkits

Platform

Infrastructure

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fault Tolerance

Cloud System

Failure Analysis

Fault Injection

Fault Recovery

AI System

Failure Analysis

Mixed layer

Service

Model

Framework

Tooolkits

Platform

Infrastructure

Fault Injection

Service

Model

Framework

Toolkits

Platform

Infrastructure