Skip to content

Latest commit

 

History

History
119 lines (99 loc) · 13.2 KB

README.md

File metadata and controls

119 lines (99 loc) · 13.2 KB

Root cause analysis in distributed system

Content

Metric

  • 23_Asplos_ShapleyIQ: Influence Quantification by Shapley Values for Performance Debugging of Microservices [paper]
  • 23_ASE_PERFCE: Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis [paper] [code]
  • 23_ICSE-SEIP_CONAN: Diagnosing Batch Failures for Cloud Systems [paper]
  • 23_WWW_CMDiagnostor: An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data [paper] [code]
  • 23_WWW_CausIL: Causal Graph for Instance Level Microservice Data [paper] [code]
  • 22_Cloud_MicroLens: A Performance Analysis Framework for Microservices Using Hidden Metrics With BPF [paper]
  • 22_DSN_RAPMiner: A Generic Anomaly Localization Mechanism for CDN System with Multi-dimensional KPIs [paper] [code]
  • 22_ASE_Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems [paper] [code]
  • 22_FSE_Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems [paper] [code]
  • 21_ICSE Workshop_MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems [paper]
  • 21_ICSE Workshop_MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems [paper]
  • 21_ISSRE_Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems [paper]
  • 20_ASE_ImpAPTr: A Tool For Identifying The Clues To Online Service Anomalies [paper] [code]
  • 20_VLDB_VLDB_Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases [paper] [code]
  • 20_IWQoS_Localizing Failure Root Causes in a Microservice through Causality Inference [paper]
  • 20_NOMS_MicroRCA: Root Cause Localization of Performance Issues in Microservices [paper] [code]
  • 20_ICSE_Debugging Crashes using Continuous Contrast Set Mining [paper]
  • 19_WWW_ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Smallwindow Long-tail Latency in Large-scale Microservice Platforms [paper] [code]
  • 18_ICSOC_Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments [paper]
  • 18_OSDI_Orca: Differential Bug Localization in Large-Scale Services [paper] Awarded Best Paper! 👍
  • 14_NSDI_Adtributor: Revenue Debugging in Advertising Systems [paper]
  • 16_ICSE_iDice: Problem Identification for Emerging Issues [paper]

Log

  • 21_ISSTA_Faster, Deeper, Easier: Crowdsourcing Diagnosis of Microservice Kernel Failure from User Space [paper] [code]
  • 20_ICSE-SEIP_DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services [paper] [code]
  • 19_ICSE_Mining Historical Test Logs to Predict Bugs and Localize Faults in the Test Logs [paper]
  • 18_FSE_Identifying Impactful Service System Problems via Log Analysis [paper] code
  • 18_ESEM_Spectrum-Based Log Diagnosis [paper]
  • 14_Cluster_Digging Deeper into Cluster System Logs for Failure Prediction and Root Cause Diagnosis [paper]

Trace

  • 23_FSE_TraceDiag: Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems [paper]
  • 23_DATE_ImpactTracer: Root Cause Localization in Microservices Based on Fault Propagation Modeling [paper]
  • 22_Cloud_Localizing and Explaining Faults in Microservices Using Distributed Tracing [paper]
  • 22_ICSOC_MicroSketch: Lightweight and Adaptive Sketch based Performance Issue Detection and Localization in Microservice Systems [paper]
  • 22_ESE_Enjoy your observability: an industrial survey of microservice tracing and analysis [paper]
  • 21_IWQoS_Practical Root Cause Localization for Microservice Systems via Trace Analysis [paper] [code]
  • 21_Usenix ATC_Argus: Debugging Performance Issues in Modern Desktop Applications with Annotated Causal Tracing [paper] [code] Awarded Best Paper! 👍
  • 21_WWW_MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments [paper] [code]
  • 21_ICSE_Scalable Statistical Root Cause Analysis on App Telemetry [paper]
  • 21_JSEP_TraceRank: Abnormal service localization with dis‐aggregated end‐to‐end tracing data in cloud native systems [paper]
  • 20_FSE_Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis [paper]
  • 20_CCGrid_T-Rank:A Lightweight Spectrum based Fault Localization Approach for Microservice Systems [paper]
  • 19_SOSP_The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure [paper]
  • 18_Asplos_FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems [paper]
  • 18_ATC_Troubleshooting Transiently-Recurring Problems in Production Systems with Blame-Proportional Logging [paper]
  • 16_TPDS_Failure Diagnosis for Distributed Systems using Targeted Fault Injection [paper]
  • 14_OSDI_The Mystery Machine: End-to-end performance analysis of large-scale Internet services [paper]
  • 04_OSDI_Using Magpie for request extraction and workload modelling [paper]
  • 04_NSDI_Path-Based Failure and Evolution Management [paper]
  • 02_DSN_Pinpoint: Problem Determination in Large, Dynamic Internet Services [paper]

Metric and Log

  • 21_CIKM_CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms [paper]

Metric and Trace

  • 21_ASPLOS_Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices [paper]

Metric, Log and Trace

  • 23_TSC_Robust Failure Diagnosis of Microservice System through Multimodal Data [paper] [data]
  • 23_FSE_Nezha: Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data [paper] [code]
  • 23_ICSE_Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data [paper]
  • 22_ICCBR_MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting [paper] [code]
  • 22_TSE_TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems [paper]
  • 22_ICSE_PerfSig: Extracting Performance Bug Signatures via Multi-modality Causal Analysis [paper] [code]
  • 21_IPDSP_Diagnosing Performance Issues in Microservices with Heterogeneous Data Source [paper]
  • 20_ESOCC_Multi-source Distributed System Data for AI-Powered Analytics [paper] [data] [code]

Network

  • 23_ICPP_MARS: Fault Localization in Programmable Networking Systemswith Low-cost In-Band Network Telemetry [paper]
  • 20_Sigcom_Microscope: Queue-based Performance Diagnosis for Network Functions [paper]
  • 16_Sigcom_The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance [paper]

Alert

  • 22_Online Summarizing Alerts through Semantic and Behavior Information [paper]
  • 20_ICSE_Understanding and handling alert storm for online service systems [paper]
  • 20_INFOCOM_Automatically and Adaptively Identifying Severe Alerts for Online Service Systems [paper]
  • 15_KDD_Unveiling clusters of events for alert and incident management in large-scale enterprise it [paper]

Knowledge Graph

  • 20_Applied Science_A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications [paper]

Change

24_FSE_ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems [paper] 18_OSDI_Orca: Differential Bug Localization in Large-Scale Services [paper] 07_SOSP_Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System paper

Profile

22_NSDI_How to diagnose nanosecond network latencies in rich end-host stacks [paper] 18_HPDC_Profiling Distributed Systems in Lightweight Virtualized Environments with Logs and Resource Metrics [paper] 16_OSDI_Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle [paper] 06_OSDI_Operating System Profiling via Latency Analysis [paper]