This is an on-going attempt to consolidate interesting efforts in the area of understanding / interpreting / explaining / visualizing a pre-trained ML model.
- DeepVis: Deep Visualization Toolbox. Yosinski et al. ICML 2015 code | pdf
- SWAP: Generate adversarial poses of objects in a 3D space. Alcorn et al. CVPR 2019 code | pdf
- AllenNLP: Query online NLP models with user-provided inputs and observe explanations (Gradient, Integrated Gradient, SmoothGrad). Last accessed 03/2020 demo
- CNN visualizations (feature visualization, PyTorch)
- iNNvestigate (attribution, Keras)
- DeepExplain (attribution, Keras)
- Lucid (feature visualization, attribution, Tensorflow)
- TorchRay (attribution, PyTorch)
- Captum (attribution, PyTorch)
- InterpretML (attribution, Python)
- Methods for Interpreting and Understanding Deep Neural Networks. Montavon et al. 2017 pdf
- Visualizations of Deep Neural Networks in Computer Vision: A Survey. Seifert et al. 2017 pdf
- How convolutional neural network see the world - A survey of convolutional neural network visualization methods. Qin et al. 2018 pdf
- A brief survey of visualization methods for deep learning models from the perspective of Explainable AI. Chalkiadakis 2018 pdf
- A Survey Of Methods For Explaining Black Box Models. Guidotti et al. 2018 pdf
- Understanding Neural Networks via Feature Visualization: A survey. Nguyen et al. 2019 pdf
- Explaining Explanations: An Overview of Interpretability of Machine Learning. Gilpin et al. 2019 pdf
- DARPA updates on the XAI program pdf
- Explainable Artificial Intelligence: a Systematic Review. Vilone at al. 2020 pdf
- Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Rudin et al. Nature 2019 pdf
- Interpretable Machine Learning: Fundamental Principles and 10 Grand Challenges. Rudin et al 2021 pdf
- The Mythos of Model Interpretability. Lipton 2016 pdf
- Towards A Rigorous Science of Interpretable Machine Learning. Doshi-Velez & Kim. 2017 pdf
- Interpretable machine learning: definitions, methods, and applications. Murdoch et al. 2019 pdf
- A Guide for Making Black Box Models Explainable. Molnar 2019 pdf
- AM: Visualizing higher-layer features of a deep network. Erhan et al. 2009 pdf
- Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
- DeepVis: Understanding Neural Networks through Deep Visualization. Yosinski et al. ICML workshop 2015 pdf | url
- MFV: Multifaceted Feature Visualization: Uncovering the different types of features learned by each neuron in deep neural networks. Nguyen et al. ICML workshop 2016 pdf | code
- DGN-AM: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Nguyen et al. NIPS 2016 pdf | code
- PPGN: Plug and Play Generative Networks. Nguyen et al. CVPR 2017 pdf | code
- Feature Visualization. Olah et al. 2017 url
- Diverse feature visualizations reveal invariances in early layers of deep neural networks. Cadena et al. 2018 pdf
- Computer Vision with a Single (Robust) Classifier. Santurkar et al. NeurIPS 2019 pdf | blog | code
- BigGAN-AM: Improving sample diversity of a pre-trained, class-conditional GAN by changing its class embeddings. Li et al. 2019 pdf
- Visualizing and Understanding Recurrent Networks. Kaparthey et al. ICLR 2015 pdf
- Object Detectors Emerge in Deep Scene CNNs. Zhou et al. ICLR 2015 pdf
- Understanding Deep Architectures by Interpretable Visual Summaries. Godi et al. BMVC 2019 pdf
- Understanding Deep Image Representations by Inverting Them. Mahendran & Vedaldi. CVPR 2015 pdf
- Inverting Visual Representations with Convolutional Networks. Dosovitskiy & Brox. CVPR 2016 pdf
- Neural network inversion beyond gradient descent. Wong & Kolter. NIPS workshop 2017 pdf
- Image Processing Using Multi-Code GAN Prior. Gu et al. 2019 pdf
- Interpreting CNNs via Decision Trees pdf
- Distilling a Neural Network Into a Soft Decision Tree pdf
- Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation. Tan et al. 2018 pdf
- Improving the Interpretability of Deep Neural Networks with Knowledge Distillation. Liu et al. 2018 pdf
A4. Quantitatively characterizing hidden features
- TCAV: Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors. Kim et al. 2018 pdf | code
- Automating Interpretability: Discovering and Testing Visual Concepts Learned by Neural Networks. Ghorbani et al. 2019 pdf
- SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. Raghu et al. 2017 pdf | code
- A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization Lens. Saini et al. 2018 pdf
- Network Dissection: Quantifying Interpretability of Deep Visual Representations. Bau et al. CVPR 2017 url | pdf
- GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. Bau et al. ICLR 2019 pdf
- Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks. Fong & Vedaldi CVPR 2018 pdf
- Intriguing generalization and simplicity of adversarially trained neural networks. Agarwal, Chen, Nguyen 2020 pdf
- Understanding the Role of Individual Units in a Deep Neural Network. Bau et al. PNAS 2020 pdf
- How Important Is a Neuron? Dhamdhere et al. 2018 pdf
- NLIZE: A Perturbation-Driven Visual Interrogation Tool for Analyzing and Interpreting Natural Language Inference Models. Liu et al. 2018 pdf
- Feature Removal Is A Unifying Principle For Model Explanation Methods. Covert et al. 2020 pdf
- A Taxonomy and Library for Visualizing Learned Features in Convolutional Neural Networks pdf
- Deep inside convolutional networks: Visualising image classification models and saliency maps. Simonyan et al. 2013 pdf
- Deconvnet: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
- Guided-backprop: Striving for simplicity: The all convolutional net. Springenberg et al. 2015 pdf
- SmoothGrad: removing noise by adding noise. Smilkov et al. 2017 pdf
- DeepLIFT: Learning important features through propagating activation differences. Shrikumar et al. 2017 pdf
- Integrated Gradients: Axiomatic Attribution for Deep Networks. Sundararajan et al. 2018 pdf | code
- Expected Gradients: Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf | code
- I-GOR: Visualizing Deep Networks by Optimizing with Integrated Gradients. Qi et al. 2019 pdf
- BlurIG: Attribution in Scale and Space. Xu et al. CVPR 2020 pdf | code
- XRAI: Better Attributions Through Regions. Kapishnikov et al. ICCV 2019 pdf | code
- LRP: Beyond saliency: understanding convolutional neural networks from saliency prediction on layer-wise relevance propagation pdf
- DTD: Explaining NonLinear Classification Decisions With Deep Tayor Decomposition pdf
- CAM: Learning Deep Features for Discriminative Localization. Zhou et al. 2016 code | web
- Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Selvaraju et al. 2017 pdf
- Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. Chattopadhyay et al. 2017 pdf | code
- Smooth Grad-CAM++: An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models. Omeiza et al. 2019 pdf
- NormGrad: There and Back Again: Revisiting Backpropagation Saliency Methods. Rebuffi et al. CVPR 2020 pdf | code
- MP: Interpretable Explanations of Black Boxes by Meaningful Perturbation. Fong et al. 2017 pdf
- FIDO: Explaining image classifiers by counterfactual generation. Chang et al. ICLR 2019 pdf
- FG-Vis: Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks. Wagner et al. CVPR 2019 pdf
- Full-Gradient Representation for Neural Network Visualization. Srinivas et al. NeurIPS 2019 pdf
- Bias also matters: Bias attribution for deep neural network explanation. Wang et al. ICML 2019 pdf
- Visual explanation by interpretation: Improving visual feedback capabilities of deep neural networks. Oramas et al. 2019 pdf
- Regional Multi-scale Approach for Visually Pleasing Explanations of Deep Neural Networks. Seo et al. 2018 pdfb
- Multimodal explanations: Justifying decisions and pointing to the evidence. Park et al. CVPR 2018 pdf
- Attention is not Explanation. Jain & Wallace. NAACL 2019 pdf
- Attention is not not Explanation. Wiegreffe & Pinter. EMNLP 2019 pdf
- Learning to Deceive with Attention-Based Explanations. Pruthi et al. ACL 2020 pdf
- Sliding-Patch: Visualizing and understanding convolutional networks. Zeiler et al. 2014 pdf
- PDA: Visualizing deep neural network decisions: Prediction difference analysis. Zintgraf et al. ICLR 2017 pdf
- RISE: Randomized Input Sampling for Explanation of Black-box Models. Petsiuk et al. BMVC 2018 pdf
- LIME: Why should i trust you?: Explaining the predictions of any classifier. Ribeiro et al. 2016 pdf | blog
- SHAP: A Unified Approach to Interpreting Model Predictions. Lundberg et al. 2017 pdf | code
- OSFT: Interpreting Black Box Models via Hypothesis Testing. Burns et al. 2019 pdf
- The (Un)reliability of saliency methods. Kindermans et al. 2018 pdf
- ROAR: A Benchmark for Interpretability Methods in Deep Neural Networks. Hooker et al. NeurIPS 2019 pdf | code
- Sanity Checks for Saliency Maps. Adebayo et al. 2018 pdf
- A Theoretical Explanation for Perplexing Behaviors of Backpropagation-based Visualizations. Nie et al. 2018 pdf
- BIM: Towards Quantitative Evaluation of Interpretability Methods with Ground Truth. Yang et al. 2019 pdf
- On the (In)fidelity and Sensitivity for Explanations. Yeh et al. 2019 pdf
- SAM: The Sensitivity of Attribution Methods to Hyperparameters. Bansal, Agarwal, Nguyen. CVPR 2020 pdf | code
- Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? Hase & Bansal ACL 2020 pdf | code
- Teach Me to Explain: A Review of Datasets for Explainable NLP. Wiegreffe & Marasović 2021 pdf | web
- Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations. Ross et al. IJCAI 2017 pdf
- Learning Explainable Models Using Attribution Priors. Erion et al. 2019 pdf
- Interpretations are useful: penalizing explanations to align neural networks with prior knowledge. Rieger et al. 2019 pdf
- This Looks Like That: Deep Learning for Interpretable Image Recognition. Chen et al. NeurIPS 2019 pdf | code
- Learning how to explain neural networks: PatternNet and PatternAttribution pdf
- Deep Learning for Case-Based Reasoning through Prototypes pdf
- Unsupervised Learning of Neural Networks to Explain Neural Networks pdf
- Automated Rationale Generation: A Technique for Explainable AI and its Effects on Human Perceptions pdf
- Rationalization: A Neural Machine Translation Approach to Generating Natural Language Explanations pdf
- Towards robust interpretability with self-explaining neural networks. Alvarez-Melis and Jaakola 2018 pdf
- Counterfactual Explanations for Machine Learning: A Review. Verma et al. 2020 pdf
- Interpreting Neural Network Judgments via Minimal, Stable, and Symbolic Corrections. Zhang et al. 2018 pdf
- Counterfactual Visual Explanations. Goyal et al. 2019 pdf
- Generative Counterfactual Introspection for Explainable Deep Learning. Liu et al. 2019 pdf
- Generative causal explanations of black-box classifiers. O’Shaughnessy et al. 2020 pdf
- Removing input features via a generative model to explain their attributions to classifier's decisions. Agarwal et al. 2019 pdf | code
- Yang, S. C. H., & Shafto, P. Explainable Artificial Intelligence via Bayesian Teaching. NIPS 2017 pdf
- Explainable AI for Designers: A Human-Centered Perspective on Mixed-Initiative Co-Creation pdf
- ICADx: Interpretable computer aided diagnosis of breast masses. Kim et al. 2018 pdf
- Neural Network Interpretation via Fine Grained Textual Summarization. Guo et al. 2018 pdf
- LS-Tree: Model Interpretation When the Data Are Linguistic. Chen et al. 2019 pdf