Skip to content

Latest commit

 

History

History
280 lines (241 loc) · 31.4 KB

README.md

File metadata and controls

280 lines (241 loc) · 31.4 KB

Fault Tolerance

Content

Cloud System

Failure Analysis

-24_DSN_Mutiny! How does Kubernetes fail, and what can we do about it? [paper] [code]

  • 23_Eurosys_Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems [paper] [code]
  • 22_ICSE-SEIP_An Empirical Study on Change-induced Incidents of Online Service Systems [paper]
  • 22_SoCC_How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service [paper] Awarded Best Paper! 👍
  • 22_ISSRE_Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling [paper] [code]
  • 21_SOSP_Understanding and Detecting Software Upgrade Failures in Distributed Systems [paper]
  • 21_DSN_Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes [paper] [data]
  • 20_FSE_Towards Intelligent Incident Management: Why We Need It and How We Make It [paer]
  • 19_ICSE-SEIP_An Empirical Investigation of Incident Triage for Online Service Systems [paper]
  • 18_OSDI_An Analysis of Network-Partitioning Failures in Cloud Systems [paper]
  • 18_FAST_Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems [paper]
  • 17_TPDS_Failure Diagnosis for Distributed Systems Using Targeted Fault Injection [paper]
  • 16_SIGCOM_Taking the Blame Game out of Data Centers Operations with NetPoirot [paper]
  • 13_SoCC_Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems paper

Fault Injection

  • 24_Neural Fault Injection: Generating Software Faults from Natural Language [paper]
  • 24_S&P_Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay [paper] [code]
  • 24_TDSC_MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications [paper]
  • 23_SOSP_Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management [paper] [code]
  • 23_CCS_Phoenix: Detect and Locate Resilience Issues in Blockchain via Context-Sensitive Chaos [paper]
  • 23_Failure Identification Using Model-Implemented Fault Injection with Domain Knowledge-Guided Reinforcement Learning [paper]
  • 23_ICSE_Coverage Guided Fault Injection for Cloud Systems [paper] [code]
  • 23_NSDI_Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker [paper] [code]
  • 23_Eurosys_Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems [paper] [code]
  • 22_OSDI_Automatic Reliability Testing for Cluster Management Controllers [paper] [code]
  • 22_FSE_IBIR: Bug Report driven Fault Injection [paper] [code]
  • 22_ISSRE_SlowCoach Mutating Code to Simulate Performance Bugs [paper]
  • 22_SBES_Towards a Fault Taxonomy for Microservices-Based Applications [paper]
  • 21_PPoPP_Understanding a Program’s Resiliency Through Error Propagation [paper]
  • 20_ASE_CoFI: Consistency-Guided Fault Injection for Cloud Systems [paper] [code]
  • 20_ISSRE_How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method [paper]
  • 20_DSN_ProFIPy: Programmable Software Fault Injection as-a-Service [paper]
  • 20_ASE_CoFI: Consistency-Guided Fault Injection for Cloud Systems [paper] [code]
  • 20_ICWS_Fitness-guided Resilience Testing of Microservice-based Applications [paper]
  • 19_HotCloud_Co-evolving Tracing and Fault Injection with Box of Pain [paper]
  • 19_ChaosRCA_Observability and Chaos Engineering on System [paper]
  • 19_Chaos_TRIPLE AGENT- Monitoring, Perturbation and Failure-obliviousness for Automated Resilience Improvement in Java Applications [paper]
  • 18_Chaos_A Program-Aware Fault-Injection Method for Dependability Evaluation Against Soft-Error Using Genetic Algorithm [paper]
  • 18_TDSC_Faultprog: Testing the Accuracy of Binary-Level Software Fault Injection [paper]
  • 16_SoCC_Automating Failure Testing Research at Internet Scale [paper]
  • 16_Survey_Assessing Dependability with Software Fault Injection: A Survey [paper]
  • 15_SIGMOD_Lineage-driven Fault Injection [paper]

Fault Recovery

  • 24_SoCC_Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure [paper]
  • 24_Eurosys_Atlas: Hybrid Cloud Migration Advisor for Interactive Microservices [paper]
  • 22_KDD_NENYA: Cascade Reinforcement Learning for Cost-Aware Failure Mitigation at Microsoft 365 [paper]
  • 22_SoCC_Method Overloading the Circuit [paper] [video]
  • 21_DSN_FIRestarter: Practical Software Crash Recovery with Targeted Library-level Fault Injection [paper] [code]
  • 20_OSDI_Narya: Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions [paper]

AI System

Failure Analysis

Mixed layer

  • 23_ICSE-SEIP_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
  • 23_TOSEM_Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective [paper]
  • 22_FSE_Understanding Performance Problems in Deep Learning Systems [paper] [code]
  • 20_ICSE_An Empirical Study on Program Failures of Deep Learning Jobs [paper]
  • 19_Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads [paper]

Service

  • 21_EDBT_JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models [paper]
  • 23_QRS_Online Data Drift Detection for Anomaly Detection Services based on Deep Learning towards Multivariate Time Series [paper]
  • 22_EMNLP_On the Impact of Temporal Concept Drift on Model Explanations [paper]
  • 19_IOP_An Overview of Overfitting and its Solutions [paper]
  • 23_How is ChatGPT’s behavior changing over time? [paper]
  • 22_ISSRE_LLTFI: Framework Agnostic Fault Injection for Machine Learning Applications (Tools and Artifact Track) [paper] [code]
  • 23_MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications [paper]
  • 22_Systematic literature review on software quality for AI-based software [paper]
  • 21_SWQD_Software Quality for AI: Where We Are Now? [paper]
  • 21_ICSE_Are Machine Learning Cloud APIs Used Correctly? [paper]
  • 23_AI for Cybersecurity: A Study on Machine Learning and DoS Attacks AI Robustness and Bypassing Detection Methods [paper]
  • 23_Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks [paper]
  • 21_Advances in Adversarial Attacks and Defenses in Computer Vision: A Survey [paper]
  • 24_AAAI_Visual Adversarial Examples Jailbreak Aligned Large Language Models [paper]

Model

  • 22_Faults in deep reinforcement learning programs: a taxonomy and a detection approach [paper]
  • 23_ICSE_Data Quality for Software Vulnerability Datasets [paper]
  • 22_Author Correction: Advances, challenges and opportunities in creating data for trustworthy AI [paper]
  • 22_A review: Data pre-processing and data augmentation techniques [paper]
  • 23_VLDB_Data collection and quality challenges in deep learning: a data-centric AI perspective [paper]
  • 20_Impact of fully connected layers on performance of convolutional neural networks for image classification [paper]
  • 20_INMIC_Effects of hidden layers on the efficiency of neural networks [paper]
  • 24_ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [paper]
  • 23_Finding Neurons in a Haystack: Case Studies with Sparse Probing [paper]
  • 19_NeurIPS_Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence [paper]
  • 23_Exploring the Relationship Between Learning Rate, Batch Size, and Epochs in Deep Learning: An Experimental Study [paper]
  • 23_A Comprehensive Overview and Comparative Analysis on Deep Learning Models: CNN, RNN, LSTM, GRU [paper]
  • 22_Activation functions in deep learning: A comprehensive survey and benchmark [paper]
  • 22_A comprehensive survey on regularization strategies in machine learning [paper]
  • 21_Comparison of optimization techniques based on gradient descent algorithm: A review [paper]
  • 20_A comprehensive survey of loss functions in machine learning [paper]
  • 22_Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts [paper]

Framework

  • 24_ESE_Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow [paper] [code]
  • 23_ICPC_Understanding Bugs in Multi-Language Deep Learning Frameworks [paper]
  • 23_TOSEM_Toward Understanding Deep Learning Framework Bugs [paper]
  • 22_ICSME_An Empirical Study on Performance Bugs in Deep Learning Frameworks [paper]
  • 20_ICSE_Taxonomy of Real Faults in Deep Learning Systems [paper]
  • 19_FSE_A Comprehensive Study on Deep Learning Bug Characteristics [paper]
  • 18_FSE_An Empirical Study on TensorFlow Program Bugs [paper]
  • 24_NSDI_Characterization of Large Language Model Development in the Datacenter [paper]
  • 22_INFSOF_A comprehensive empirical study on bug characteristics of deep learning frameworks [paper]
  • 22_ASE_Towards Understanding the Faults of JavaScript-Based Deep Learning Systems [paper]
  • 20_ICSE_An empirical study on program failures of deep learning jobs [paper]
  • 12_ISSRE_An Empirical Study of Bugs in Machine Learning Systems [paper]
  • 23_TCAD_Statistical Modeling of Soft Error Influence on Neural Networks [paper]
  • 23_Towards efficient generative large language model serving: A survey from algorithms to systems [paper]
  • 20_DFT_A Pipelined Multi-Level Fault Injector for Deep Neural Networks [paper]
  • 20_ICSE_An empirical study on program failures of deep learning jobs [paper]
  • 23_ISCE_An Empirical Study on Quality Issues of Deep Learning Platform [paper]

Tooolkits

  • 13_ESOP_Interleaving and lock-step semantics for analysis and verification of GPU kernels [paper]
  • 19_ASE_Automating CUDA Synchronization via Program Transformation [paper]
  • 18_ISSRE_Bugaroo: Exposing Memory Model Bugs in Many-Core Systems [paper]
  • 19_SP_SoK: Sanitizing for Security [paper]
  • 23_PLDI_cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications [paper]
  • 23_FSE_Demystifying Dependency Bugs in Deep Learning Stack [paper]
  • 23_ISCE_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
  • 24_NSDI_Characterization of Large Language Model Development in the Datacenter [paper]
  • 19_ATC_Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads [paper]
  • 14_PLDI_Accurate application progress analysis for large-scale parallel debugging [paper]
  • 04_SC_Assessing Fault Sensitivity in MPI Applications [paper]
  • 16_TECS_CUDA Leaks: A Detailed Hack for CUDA and a (Partial) Fix [paper]
  • 18_PLDI_CURD: a dynamic CUDA race detector [paper]
  • 17_JSA_Evaluation of transient errors in GPGPUs for safety critical applications: An effective simulation-based fault injection environment [paper]
  • 14_DAC_Exploring the Heterogeneous Design Space for both Performance and Reliability [paper]
  • 14_LATW_Implementation and experimental evaluation of a CUDA core under single event effects [paper]
  • 17_PVM_What does fault tolerant deep learning need from MPI? [paper]

Platform

  • 21_ACCESS_Diaspore: Diagnosing Performance Interference in Apache Spark [paper]
  • 15_HPCC-CSS-ICESS_Performance Prediction for Apache Spark Platform [paper]
  • 17_ICWS_Log-based Abnormal Task Detection and Root Cause Analysis for Spark [paper]
  • 23_ICSE-SEIP_An Empirical Study on Quality Issues of Deep Learning Platform [paper]

Infrastructure

  • 21_DSN_Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes [paper]
  • 20_SC_GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability [paper]
  • 17_SC_Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications [paper]
  • 15_SC_Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility [paper]
  • 16_HPCA_A large-scale study of soft-errors on GPUs in the field [paper]
  • 17_MASCOTS_Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities [paper]
  • 15_HPCA_Understanding GPU errors on large-scale HPC systems and the implications for system design and operation [paper]
  • 14_DSN_Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters [paper]
  • 20_ICSE_An empirical study on program failures of deep learning jobs [paper]
  • 23_SYSTOR_Predicting GPU Failures With High Precision Under Deep Learning Workloads [paper]
  • 14_LISAT_Reliability and fault tolerance analysis of FPGA platforms [paper]
  • 16_J_NET_Field Programmable Gate Array Reliability Analysis Using the Dynamic Flowgraph Methodology [paper]
  • 10_TII_Component-Based Safety Analysis of FPGAs [paper]
  • 21_TVLSI_Reliability Evaluation and Analysis of FPGA-Based Neural Network Acceleration System [paper]
  • 24_TNS_Tensor Processing Unit Reliability Dependence on Temperature and Radiation Source [paper]
  • 22_RADECS_Sensitivity of Google’s Tensor Processing Units to High-Energy, Mono-Energetic, and Thermal Neutrons [paper]
  • 22_DATE_Reliability of Google's Tensor Processing Units for Embedded Applications [paper]

Fault Injection

  • 22_SANER_How Do Injected Bugs Affect Deep Learning? [paper] [code]

Service

  • 24_ICSE_Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [paper]
  • 15_ICLR_Explaining and Harnessing Adversarial Examples [paper] [code]
  • 18_ICLR_Towards Deep Learning Models Resistant to Adversarial Attacks [paper] [code]
  • 22_DI-AA: An interpretable white-box attack for fooling deep neural networks [paper]
  • 20_SP_HopSkipJumpAttack: A Query-Efficient Decision-Based Attack [paper] [code]
  • 21_ICML_PopSkipJump: Decision-Based Attack for Probabilistic Classifiers [paper] [code]
  • 20_CVPR_GeoDA: A Geometric Framework for Black-Box Adversarial Attacks [paper] [code]
  • 23_I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models [paper]
  • 23_Promptaid: Prompt exploration, perturbation, testing and iteration using visual analytics for large language models [paper]
  • 23_Prompt Injection attack against LLM-integrated Applications [paper]
  • 24_Goal-guided Generative Prompt Injection Attack on Large Language Models [paper]
  • 20_DSN_PyTorchFI: A Runtime Perturbation Tool for DNNs [paper] [code]
  • 18_ISSRE_TensorFI: A Configurable Fault Injector for TensorFlow Applications [paper] [code]
  • 21_DeepTest_TF-DM: Tool for Studying ML Model Resilience to Data Faults [paper] [code]
  • 21_ISSREW_MindFI: A Fault Injection Tool for Reliability Assessment of MindSpore Applicacions [paper]
  • 18_DAC_Ares: a framework for quantifying the resilience of deep neural networks [paper] [code]
  • 19_SC_BinFI : an efficient fault injector for safety-critical machine learning systems [paper] [code]

Model

  • 18_ISSRE_DeepMutation: Mutation Testing of Deep Learning Systems [paper] [code]
  • 19_ASE_DeepMutation++: A Mutation Testing Framework for Deep Learning Systems [paper]
  • 18_QRS_MuNN: Mutation Analysis of Neural Networks [paper]
  • 21_ISSTA_DeepCrime: Mutation Testing of Deep Learning Systems Based on Real Faults [paper] [code]
  • 22_Towards mutation testing of Reinforcement Learning systems [paper]
  • 23_ICST_Mutation Testing of Deep Reinforcement Learning Based on Real Faults [paper]
  • 22_SETTA_MTUL: Towards Mutation Testing of Unsupervised Learning Systems [paper]

Framework

Toolkits

  • 15_Cluster_Fast Fault Injection and Sensitivity Analysis for Collective Communications [paper]
  • 20_ICSE_Simulee: detecting CUDA synchronization bugs via memory-access modeling [paper] [code]
  • 20_COMPSAC_CUDAsmith: A Fuzzer for CUDA Compilers [paper] [code]
  • 15_SIGPLAN_Many-core compiler fuzzing [paper] [code]
  • 00_IPDPS_FIMD-MPI: a tool for injecting faults into MPI application [paper]

Platform

  • 22_STVR_TRANSMUT-Spark: Transformation mutation for Apache Spark [paper]
  • 23_FSE_Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics [paper]

Infrastructure

  • 17_ISPASS_SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation [paper]
  • 14_ISPASS_GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications [paper]
  • 21_DSN_NVBitFI: Dynamic Fault Injection for GPUs[paper]
  • 16_SC_Understanding Error Propagation in GPGPU Applications [paper]
  • 17_SC_Understanding error propagation in deep learning neural network (DNN) accelerators and applications [paper]
  • 17_HPCA_RadiationInduced Error Criticality in Modern HPC Parallel Accelerators [paper]
  • 18_SBAC-PAD_On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation [paper]
  • 12_ISCA_A defect-tolerant accelerator for emerging high-performance applications [paper]
  • 21_VTS_Combining Architectural Simulation and Software Fault Injection for a Fast and Accurate CNNs Reliability Evaluation on GPUs [paper]
  • 02_DFT_Using run-time reconfiguration for fault injection in hardware prototypes [paper]
  • 12_DFT_Fast single-FPGA fault injection platform [paper]
  • 14_J.MICROEL_A fast, flexible, and easy-to-develop FPGA-based fault injection technique [paper]
  • 01_ATS_FPGA-based fault injection for microprocessor systems [paper]
  • 20_Desgin&Test_Enabling Timing Error Resilience for Low-Power Systolic-Array Based Deep Learning Accelerators [paper]
  • 21_ATS_GPU-Accelerated Timing Simulation of Systolic-Array-Based AI Accelerators [paper]
  • Lightning: Leveraging DVFS-induced Transient Fault Injection to Attack Deep Learning Accelerator of GPUs [paper]