Content
-24_DSN_Mutiny! How does Kubernetes fail, and what can we do about it? [paper] [code]
- 23_Eurosys_Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems [paper] [code]
- 22_ICSE-SEIP_An Empirical Study on Change-induced Incidents of Online Service Systems [paper]
- 22_SoCC_How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service [paper] Awarded Best Paper! 👍
- 22_ISSRE_Going through the Life Cycle of Faults in Clouds:Guidelines on Fault Handling [paper] [code]
- 21_SOSP_Understanding and Detecting Software Upgrade Failures in Distributed Systems [paper]
- 21_DSN_Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes [paper] [data]
- 20_FSE_Towards Intelligent Incident Management: Why We Need It and How We Make It [paer]
- 19_ICSE-SEIP_An Empirical Investigation of Incident Triage for Online Service Systems [paper]
- 18_OSDI_An Analysis of Network-Partitioning Failures in Cloud Systems [paper]
- 18_FAST_Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems [paper]
- 17_TPDS_Failure Diagnosis for Distributed Systems Using Targeted Fault Injection [paper]
- 16_SIGCOM_Taking the Blame Game out of Data Centers Operations with NetPoirot [paper]
- 13_SoCC_Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems paper
- 24_Neural Fault Injection: Generating Software Faults from Natural Language [paper]
- 24_S&P_Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay [paper] [code]
- 24_TDSC_MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications [paper]
- 23_SOSP_Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management [paper] [code]
- 23_CCS_Phoenix: Detect and Locate Resilience Issues in Blockchain via Context-Sensitive Chaos [paper]
- 23_Failure Identification Using Model-Implemented Fault Injection with Domain Knowledge-Guided Reinforcement Learning [paper]
- 23_ICSE_Coverage Guided Fault Injection for Cloud Systems [paper] [code]
- 23_NSDI_Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker [paper] [code]
- 23_Eurosys_Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems [paper] [code]
- 22_OSDI_Automatic Reliability Testing for Cluster Management Controllers [paper] [code]
- 22_FSE_IBIR: Bug Report driven Fault Injection [paper] [code]
- 22_ISSRE_SlowCoach Mutating Code to Simulate Performance Bugs [paper]
- 22_SBES_Towards a Fault Taxonomy for Microservices-Based Applications [paper]
- 21_PPoPP_Understanding a Program’s Resiliency Through Error Propagation [paper]
- 20_ASE_CoFI: Consistency-Guided Fault Injection for Cloud Systems [paper] [code]
- 20_ISSRE_How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method [paper]
- 20_DSN_ProFIPy: Programmable Software Fault Injection as-a-Service [paper]
- 20_ASE_CoFI: Consistency-Guided Fault Injection for Cloud Systems [paper] [code]
- 20_ICWS_Fitness-guided Resilience Testing of Microservice-based Applications [paper]
- 19_HotCloud_Co-evolving Tracing and Fault Injection with Box of Pain [paper]
- 19_ChaosRCA_Observability and Chaos Engineering on System [paper]
- 19_Chaos_TRIPLE AGENT- Monitoring, Perturbation and Failure-obliviousness for Automated Resilience Improvement in Java Applications [paper]
- 18_Chaos_A Program-Aware Fault-Injection Method for Dependability Evaluation Against Soft-Error Using Genetic Algorithm [paper]
- 18_TDSC_Faultprog: Testing the Accuracy of Binary-Level Software Fault Injection [paper]
- 16_SoCC_Automating Failure Testing Research at Internet Scale [paper]
- 16_Survey_Assessing Dependability with Software Fault Injection: A Survey [paper]
- 15_SIGMOD_Lineage-driven Fault Injection [paper]
- 24_SoCC_Deoxys: A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure [paper]
- 24_Eurosys_Atlas: Hybrid Cloud Migration Advisor for Interactive Microservices [paper]
- 22_KDD_NENYA: Cascade Reinforcement Learning for Cost-Aware Failure Mitigation at Microsoft 365 [paper]
- 22_SoCC_Method Overloading the Circuit [paper] [video]
- 21_DSN_FIRestarter: Practical Software Crash Recovery with Targeted Library-level Fault Injection [paper] [code]
- 20_OSDI_Narya: Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions [paper]
- 23_ICSE-SEIP_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
- 23_TOSEM_Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective [paper]
- 22_FSE_Understanding Performance Problems in Deep Learning Systems [paper] [code]
- 20_ICSE_An Empirical Study on Program Failures of Deep Learning Jobs [paper]
- 19_Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads [paper]
- 21_EDBT_JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models [paper]
- 23_QRS_Online Data Drift Detection for Anomaly Detection Services based on Deep Learning towards Multivariate Time Series [paper]
- 22_EMNLP_On the Impact of Temporal Concept Drift on Model Explanations [paper]
- 19_IOP_An Overview of Overfitting and its Solutions [paper]
- 23_How is ChatGPT’s behavior changing over time? [paper]
- 22_ISSRE_LLTFI: Framework Agnostic Fault Injection for Machine Learning Applications (Tools and Artifact Track) [paper] [code]
- 23_MPGemmFI: A Fault Injection Technique for Mixed Precision GEMM in ML Applications [paper]
- 22_Systematic literature review on software quality for AI-based software [paper]
- 21_SWQD_Software Quality for AI: Where We Are Now? [paper]
- 21_ICSE_Are Machine Learning Cloud APIs Used Correctly? [paper]
- 23_AI for Cybersecurity: A Study on Machine Learning and DoS Attacks AI Robustness and Bypassing Detection Methods [paper]
- 23_Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks [paper]
- 21_Advances in Adversarial Attacks and Defenses in Computer Vision: A Survey [paper]
- 24_AAAI_Visual Adversarial Examples Jailbreak Aligned Large Language Models [paper]
- 22_Faults in deep reinforcement learning programs: a taxonomy and a detection approach [paper]
- 23_ICSE_Data Quality for Software Vulnerability Datasets [paper]
- 22_Author Correction: Advances, challenges and opportunities in creating data for trustworthy AI [paper]
- 22_A review: Data pre-processing and data augmentation techniques [paper]
- 23_VLDB_Data collection and quality challenges in deep learning: a data-centric AI perspective [paper]
- 20_Impact of fully connected layers on performance of convolutional neural networks for image classification [paper]
- 20_INMIC_Effects of hidden layers on the efficiency of neural networks [paper]
- 24_ShortGPT: Layers in Large Language Models are More Redundant Than You Expect [paper]
- 23_Finding Neurons in a Haystack: Case Studies with Sparse Probing [paper]
- 19_NeurIPS_Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence [paper]
- 23_Exploring the Relationship Between Learning Rate, Batch Size, and Epochs in Deep Learning: An Experimental Study [paper]
- 23_A Comprehensive Overview and Comparative Analysis on Deep Learning Models: CNN, RNN, LSTM, GRU [paper]
- 22_Activation functions in deep learning: A comprehensive survey and benchmark [paper]
- 22_A comprehensive survey on regularization strategies in machine learning [paper]
- 21_Comparison of optimization techniques based on gradient descent algorithm: A review [paper]
- 20_A comprehensive survey of loss functions in machine learning [paper]
- 22_Ideal dataset splitting ratios in machine learning algorithms: general concerns for data scientists and data analysts [paper]
- 24_ESE_Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and TensorFlow [paper] [code]
- 23_ICPC_Understanding Bugs in Multi-Language Deep Learning Frameworks [paper]
- 23_TOSEM_Toward Understanding Deep Learning Framework Bugs [paper]
- 22_ICSME_An Empirical Study on Performance Bugs in Deep Learning Frameworks [paper]
- 20_ICSE_Taxonomy of Real Faults in Deep Learning Systems [paper]
- 19_FSE_A Comprehensive Study on Deep Learning Bug Characteristics [paper]
- 18_FSE_An Empirical Study on TensorFlow Program Bugs [paper]
- 24_NSDI_Characterization of Large Language Model Development in the Datacenter [paper]
- 22_INFSOF_A comprehensive empirical study on bug characteristics of deep learning frameworks [paper]
- 22_ASE_Towards Understanding the Faults of JavaScript-Based Deep Learning Systems [paper]
- 20_ICSE_An empirical study on program failures of deep learning jobs [paper]
- 12_ISSRE_An Empirical Study of Bugs in Machine Learning Systems [paper]
- 23_TCAD_Statistical Modeling of Soft Error Influence on Neural Networks [paper]
- 23_Towards efficient generative large language model serving: A survey from algorithms to systems [paper]
- 20_DFT_A Pipelined Multi-Level Fault Injector for Deep Neural Networks [paper]
- 20_ICSE_An empirical study on program failures of deep learning jobs [paper]
- 23_ISCE_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
- 13_ESOP_Interleaving and lock-step semantics for analysis and verification of GPU kernels [paper]
- 19_ASE_Automating CUDA Synchronization via Program Transformation [paper]
- 18_ISSRE_Bugaroo: Exposing Memory Model Bugs in Many-Core Systems [paper]
- 19_SP_SoK: Sanitizing for Security [paper]
- 23_PLDI_cuCatch: A Debugging Tool for Efficiently Catching Memory Safety Violations in CUDA Applications [paper]
- 23_FSE_Demystifying Dependency Bugs in Deep Learning Stack [paper]
- 23_ISCE_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
- 24_NSDI_Characterization of Large Language Model Development in the Datacenter [paper]
- 19_ATC_Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads [paper]
- 14_PLDI_Accurate application progress analysis for large-scale parallel debugging [paper]
- 04_SC_Assessing Fault Sensitivity in MPI Applications [paper]
- 16_TECS_CUDA Leaks: A Detailed Hack for CUDA and a (Partial) Fix [paper]
- 18_PLDI_CURD: a dynamic CUDA race detector [paper]
- 17_JSA_Evaluation of transient errors in GPGPUs for safety critical applications: An effective simulation-based fault injection environment [paper]
- 14_DAC_Exploring the Heterogeneous Design Space for both Performance and Reliability [paper]
- 14_LATW_Implementation and experimental evaluation of a CUDA core under single event effects [paper]
- 17_PVM_What does fault tolerant deep learning need from MPI? [paper]
- 21_ACCESS_Diaspore: Diagnosing Performance Interference in Apache Spark [paper]
- 15_HPCC-CSS-ICESS_Performance Prediction for Apache Spark Platform [paper]
- 17_ICWS_Log-based Abnormal Task Detection and Root Cause Analysis for Spark [paper]
- 23_ICSE-SEIP_An Empirical Study on Quality Issues of Deep Learning Platform [paper]
- 21_DSN_Examining Failures and Repairs on Supercomputers with Multi-GPU Compute Nodes [paper]
- 20_SC_GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability [paper]
- 17_SC_Failures in Large Scale Systems: Long-Term Measurement, Analysis, and Implications [paper]
- 15_SC_Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility [paper]
- 16_HPCA_A large-scale study of soft-errors on GPUs in the field [paper]
- 17_MASCOTS_Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities [paper]
- 15_HPCA_Understanding GPU errors on large-scale HPC systems and the implications for system design and operation [paper]
- 14_DSN_Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters [paper]
- 20_ICSE_An empirical study on program failures of deep learning jobs [paper]
- 23_SYSTOR_Predicting GPU Failures With High Precision Under Deep Learning Workloads [paper]
- 14_LISAT_Reliability and fault tolerance analysis of FPGA platforms [paper]
- 16_J_NET_Field Programmable Gate Array Reliability Analysis Using the Dynamic Flowgraph Methodology [paper]
- 10_TII_Component-Based Safety Analysis of FPGAs [paper]
- 21_TVLSI_Reliability Evaluation and Analysis of FPGA-Based Neural Network Acceleration System [paper]
- 24_TNS_Tensor Processing Unit Reliability Dependence on Temperature and Radiation Source [paper]
- 22_RADECS_Sensitivity of Google’s Tensor Processing Units to High-Energy, Mono-Energetic, and Thermal Neutrons [paper]
- 22_DATE_Reliability of Google's Tensor Processing Units for Embedded Applications [paper]
- 24_ICSE_Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection [paper]
- 15_ICLR_Explaining and Harnessing Adversarial Examples [paper] [code]
- 18_ICLR_Towards Deep Learning Models Resistant to Adversarial Attacks [paper] [code]
- 22_DI-AA: An interpretable white-box attack for fooling deep neural networks [paper]
- 20_SP_HopSkipJumpAttack: A Query-Efficient Decision-Based Attack [paper] [code]
- 21_ICML_PopSkipJump: Decision-Based Attack for Probabilistic Classifiers [paper] [code]
- 20_CVPR_GeoDA: A Geometric Framework for Black-Box Adversarial Attacks [paper] [code]
- 23_I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models [paper]
- 23_Promptaid: Prompt exploration, perturbation, testing and iteration using visual analytics for large language models [paper]
- 23_Prompt Injection attack against LLM-integrated Applications [paper]
- 24_Goal-guided Generative Prompt Injection Attack on Large Language Models [paper]
- 20_DSN_PyTorchFI: A Runtime Perturbation Tool for DNNs [paper] [code]
- 18_ISSRE_TensorFI: A Configurable Fault Injector for TensorFlow Applications [paper] [code]
- 21_DeepTest_TF-DM: Tool for Studying ML Model Resilience to Data Faults [paper] [code]
- 21_ISSREW_MindFI: A Fault Injection Tool for Reliability Assessment of MindSpore Applicacions [paper]
- 18_DAC_Ares: a framework for quantifying the resilience of deep neural networks [paper] [code]
- 19_SC_BinFI : an efficient fault injector for safety-critical machine learning systems [paper] [code]
- 18_ISSRE_DeepMutation: Mutation Testing of Deep Learning Systems [paper] [code]
- 19_ASE_DeepMutation++: A Mutation Testing Framework for Deep Learning Systems [paper]
- 18_QRS_MuNN: Mutation Analysis of Neural Networks [paper]
- 21_ISSTA_DeepCrime: Mutation Testing of Deep Learning Systems Based on Real Faults [paper] [code]
- 22_Towards mutation testing of Reinforcement Learning systems [paper]
- 23_ICST_Mutation Testing of Deep Reinforcement Learning Based on Real Faults [paper]
- 22_SETTA_MTUL: Towards Mutation Testing of Unsupervised Learning Systems [paper]
- 20_ISSRE_TensorFI: Flexible Fault Injection Framework for TensorFlow Applications [[paper]](CSDL | IEEE Computer Society) [code]
- 22_TDSC_Fault Injection for TensorFlow Applications [[paper]](Fault Injection for TensorFlow Applications | IEEE Journals & Magazine | IEEE Xplore) [code]
- 20_Fault Injectors for TensorFlow: Evaluation of the Impact of Random Hardware Faults on Deep CNNs [[paper]]([2012.07037] Fault Injectors for TensorFlow: Evaluation of the Impact of Random Hardware Faults on Deep CNNs) [code]
- 20_LASCAS_Reliability Evaluation of Compressed Deep Learning Models [paper] [code]
- 20_DSN_PyTorchFI: A Runtime Perturbation Tool for DNNs [paper] [code]
- 22_TR_SNIFF: Reverse Engineering of Neural Networks With Fault Attacks [paper]
- 21_ISSREW_MindFI: A Fault Injection Tool for Reliability Assessment of MindSpore Applicacions [paper]
- 22_IROS_enpheeph: A Fault Injection Framework for Spiking and Compressed Deep Neural Networks [paper] [code]
- 23_TC_Fast and Accurate Error Simulation for CNNs Against Soft Errors [paper]
- 17_ICCAD_Fault injection attack on deep neural network [paper]
- 22_ISSRE_LLTFI: Framework Agnostic Fault Injection for Machine Learning Applications (Tools and Artifact Track) [paper] [code]
- 23_ETS_SCI-FI: a Smart, aCcurate and unIntrusive Fault-Injector for Deep Neural Networks [paper]
- 15_Cluster_Fast Fault Injection and Sensitivity Analysis for Collective Communications [paper]
- 20_ICSE_Simulee: detecting CUDA synchronization bugs via memory-access modeling [paper] [code]
- 20_COMPSAC_CUDAsmith: A Fuzzer for CUDA Compilers [paper] [code]
- 15_SIGPLAN_Many-core compiler fuzzing [paper] [code]
- 00_IPDPS_FIMD-MPI: a tool for injecting faults into MPI application [paper]
- 22_STVR_TRANSMUT-Spark: Transformation mutation for Apache Spark [paper]
- 23_FSE_Co-dependence Aware Fuzzing for Dataflow-Based Big Data Analytics [paper]
- 17_ISPASS_SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation [paper]
- 14_ISPASS_GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications [paper]
- 21_DSN_NVBitFI: Dynamic Fault Injection for GPUs[paper]
- 16_SC_Understanding Error Propagation in GPGPU Applications [paper]
- 17_SC_Understanding error propagation in deep learning neural network (DNN) accelerators and applications [paper]
- 17_HPCA_RadiationInduced Error Criticality in Modern HPC Parallel Accelerators [paper]
- 18_SBAC-PAD_On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation [paper]
- 12_ISCA_A defect-tolerant accelerator for emerging high-performance applications [paper]
- 21_VTS_Combining Architectural Simulation and Software Fault Injection for a Fast and Accurate CNNs Reliability Evaluation on GPUs [paper]
- 02_DFT_Using run-time reconfiguration for fault injection in hardware prototypes [paper]
- 12_DFT_Fast single-FPGA fault injection platform [paper]
- 14_J.MICROEL_A fast, flexible, and easy-to-develop FPGA-based fault injection technique [paper]
- 01_ATS_FPGA-based fault injection for microprocessor systems [paper]
- 20_Desgin&Test_Enabling Timing Error Resilience for Low-Power Systolic-Array Based Deep Learning Accelerators [paper]
- 21_ATS_GPU-Accelerated Timing Simulation of Systolic-Array-Based AI Accelerators [paper]
- Lightning: Leveraging DVFS-induced Transient Fault Injection to Attack Deep Learning Accelerator of GPUs [paper]