From 69ecefe70a8f588a2a6330d7b9462596121c6841 Mon Sep 17 00:00:00 2001
From: "Daniel C. Elton" <delton137@users.noreply.github.com>
Date: Sun, 9 Aug 2020 07:33:22 -0500
Subject: [PATCH 1/3] Update interpretability discussion Derived from
 https://github.com/greenelab/deep-review/pull/988 and split into a new branch

---
 content/06.discussion.md | 50 +++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/content/06.discussion.md b/content/06.discussion.md
index 5baa3a37..422010d8 100644
--- a/content/06.discussion.md
+++ b/content/06.discussion.md
@@ -3,7 +3,7 @@
 Despite the disparate types of data and scientific goals in the learning tasks covered above, several challenges are broadly important for deep learning in the biomedical domain.
 Here we examine these factors that may impede further progress, ask what steps have already been taken to overcome them, and suggest future research directions.
 
-### Customizing deep learning models reflects a tradeoff between bias and variance
+### Preventing overfitting and hyperparameter tuning
 
 Some of the challenges in applying deep learning are shared with other machine learning methods.
 In particular, many problem-specific optimizations described in this review reflect a recurring universal tradeoff---controlling the flexibility of a model in order to maximize predictivity.
@@ -12,8 +12,14 @@ One way of understanding such model optimizations is that they incorporate exter
 This balance is formally described as a tradeoff between "bias and variance"
 [@tag:goodfellow2016deep].
 
-Although the bias-variance tradeoff is common to all machine learning applications, recent empirical and theoretical observations suggest that deep learning models may have uniquely advantageous generalization properties [@tag:Zhang2017_generalization; @tag:Lin2017_why_dl_works].
-Nevertheless, additional advances will be needed to establish a coherent theoretical foundation that enables practitioners to better reason about their models from first principles.
+Although the bias-variance trade-off is is important to take into account with many classical machine learning models, recent empirical and theoretical observations suggest that deep neural networks in particular do not the tradeoff as expected [@tag:Belkin2019_PNAS; @tag:Zhang2017_generalization; @tag:Lin2017_why_dl_works].
+It has been demonstrated that poor generalizability (test error) can often be remedied by adding more layers and increasing the number of free parameters, in conflict with the classic bias-variance theory.
+This phenomena, known as "double descent" indicates that deep neural networks achieve their best performance when they smoothly interpolate training data - resulting in near zero training error [@tag:Belkin2019_PNAS].
+
+To optimize neural networks, hyperparaters must be tuned to yield the network with the best test error.
+This is computationally expensive and often not done, however it is important to do when making claims about the superiority of one machine learning method vs. another.
+Several examples have now been uncovered where a new method was said to be superior to a baseline method (like an LSTM or vanilla CNN) but later it was found that the difference went away after sufficient hyperparameter tuning [@tag:Sculley2018].
+A related practice which should be more widely adopted is to perform "ablation studies", where parts of a network are removed and the network is retrained, as this helps with understanding the importance of different components, including any novel ones [@tag:Sculley2018].
 
 #### Evaluation metrics for imbalanced classification
 
@@ -106,18 +112,35 @@ As a result, several opportunities for innovation arise: understanding the cause
 Unfortunately, uncertainty quantification techniques are underutilized in the computational biology communities and largely ignored in the current deep learning for biomedicine literature.
 Thus, the practical value of uncertainty quantification in biomedical domains is yet to be appreciated.
 
-### Interpretation
+### Interpretability
+
+As deep learning models achieve state-of-the-art performance in a variety of domains, there is a growing need to develop methods for interpreting how they function.
+There are several important reasons one might be interested in interpretability, which is also called "explainability".
+
+Firstly, a model that achieves breakthrough performance may have identified patterns in the data that practitioners in the field would like to understand.
+For instance, interpreting a model for predicting chemical properties from molecular graphs may illuminate previously unknown structure-property relations.
+It is also useful to see if a model is using known relationships - if not, this may suggest a way to improve the model.
+Finally, there is a chance that the model may have learned relationships that are known to be wrong.
+This can be due to improper training data or due to overfitting on spurious correlations in the training data.
 
-As deep learning models achieve state-of-the-art performance in a variety of domains, there is a growing need to make the models more interpretable.
-Interpretability matters for two main reasons.
-First, a model that achieves breakthrough performance may have identified patterns in the data that practitioners in the field would like to understand.
-However, this would not be possible if the model is a black box.
-Second, interpretability is important for trust.
-If a model is making medical diagnoses, it is important to ensure the model is making decisions for reliable reasons and is not focusing on an artifact of the data.
+This is particularly important if a model is making medical diagnoses.
 A motivating example of this can be found in Caruana et al. [@tag:Caruana2015_intelligible], where a model trained to predict the likelihood of death from pneumonia assigned lower risk to patients with asthma, but only because such patients were treated as higher priority by the hospital.
-In the context of deep learning, understanding the basis of a model's output is particularly important as deep learning models are unusually susceptible to adversarial examples [@tag:Nguyen2014_adversarial] and can output confidence scores over 99.99% for samples that resemble pure noise.
 
-As the concept of interpretability is quite broad, many methods described as improving the interpretability of deep learning models take disparate and often complementary approaches.
+It has been shown that deep learning models are unusually susceptible to carefully crafted adversarial examples [@tag:Nguyen2014_adversarial] and can output confidence scores over 99.99% for samples that resemble pure noise.
+While this is largely still an unsolved problem, the interpretation of deep learning models may help understand these failure modes and how to prevent them.
+
+Several different levels of interpretability can be distinguished.
+Consider a prototypical CNN used for image classification.
+At a high level, one can perform an occlusion or sensitivity analysis to determine what sections of an image are most important for making a classification, generating a "saliency" heatmap.
+Then, if one wishes to understand what is going on in the layers of the model, several tools have been developed for visualizing the learned feature maps, such as the deconvnet[@tag:Zeiler2013_visualizing].
+Finally, if one wishes to analyze the flow of information through a deep neural network layer-wise relevance propagation can be performed to see how  each layer contributes to different classifications.[@tag:Montavon2018_visualization]
+
+A starting point for many discussions of interpretability is the interpretability-accuracy trade-off.
+The trade-off assumes that only simple models are interpretable and often a delineation is made between “white box" models (linear regression, decision trees) that are assumed to be not very accurate and “black box" models (neural networks, kernel SVMs) which are assumed to be more accurate.
+This view is becoming outmoded, however with the development of sophisticated tools for interrogating and understanding deep neural networks, [@tag:Montavon2018_visualization; @tag:Zeiler2013_visualizing] and new methods for creating highly accurate interpretable models [@tag:Rudin2019].
+Still, this trade-off motivates a common practice whereby a easy to interpret model is trained next to a hard to interpret one, which is sometimes called "post-hoc interpretation".
+For instance, in the example discussed by Caruana et al. mentioned earlier, a rule-based model was trained next to a neural network using the same training data to understand the types of relations which may have been learned by the neural network.
+Along similar lines, a method for "distilling" a neural network into a decision tree has been developed.[@tag:Frosst2017_distilling]
 
 #### Assigning example-specific importance scores
 
@@ -219,7 +242,8 @@ Towards this end, Che et al. [@tag:Che2015_distill] used gradient boosted trees
 
 Finally, it is sometimes possible to train the model to provide justifications for its predictions.
 Lei et al. [@tag:Lei2016_rationalizing] used a generator to identify "rationales", which are short and coherent pieces of the input text that produce similar results to the whole input when passed through an encoder.
-The authors applied their approach to a sentiment analysis task and obtained substantially superior results compared to an attention-based method.
+Shen et al. [@tag:Shen2019] trained a CNN for lung nodule malignancy classification which also provides a series of attributes for the nodule, which they argue help understand how the network functions.
+These are both simple examples of an emerging approach towards engendering trust in AI systems which Elton calls "self-explaining AI" [@tag:Elton2020].
 
 #### Future outlook
 

From 3e17861c4dacb86bf8d53ce4926ac9e2f7a04efe Mon Sep 17 00:00:00 2001
From: dan <delton17@gmail.com>
Date: Wed, 12 Aug 2020 10:52:25 -0400
Subject: [PATCH 2/3] improve edits to 06.discussion.md on interpretability

---
 content/06.discussion.md | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/content/06.discussion.md b/content/06.discussion.md
index 422010d8..b45f8c0b 100644
--- a/content/06.discussion.md
+++ b/content/06.discussion.md
@@ -6,17 +6,20 @@ Here we examine these factors that may impede further progress, ask what steps h
 ### Preventing overfitting and hyperparameter tuning
 
 Some of the challenges in applying deep learning are shared with other machine learning methods.
-In particular, many problem-specific optimizations described in this review reflect a recurring universal tradeoff---controlling the flexibility of a model in order to maximize predictivity.
+In particular, many problem-specific optimizations described in this review reflect a recurring universal tradeoff---controlling the flexibility of a model in order to maximize generalization ability between train and test sets.
 Methods for adjusting the flexibility of deep learning models include dropout, reduced data projections, and transfer learning (described below).
 One way of understanding such model optimizations is that they incorporate external information to limit model flexibility and thereby improve predictions.
 This balance is formally described as a tradeoff between "bias and variance"
 [@tag:goodfellow2016deep].
 
-Although the bias-variance trade-off is is important to take into account with many classical machine learning models, recent empirical and theoretical observations suggest that deep neural networks in particular do not the tradeoff as expected [@tag:Belkin2019_PNAS; @tag:Zhang2017_generalization; @tag:Lin2017_why_dl_works].
+Although the bias-variance trade-off is is important to take into account with many classical statistical learning models with a small number of parameters, recent empirical and theoretical observations suggest that models with a large number of parameters do not the tradeoff as expected [@tag:Belkin2019_PNAS; @tag:Zhang2017_generalization; @tag:Lin2017_why_dl_works].
 It has been demonstrated that poor generalizability (test error) can often be remedied by adding more layers and increasing the number of free parameters, in conflict with the classic bias-variance theory.
-This phenomena, known as "double descent" indicates that deep neural networks achieve their best performance when they smoothly interpolate training data - resulting in near zero training error [@tag:Belkin2019_PNAS].
 
-To optimize neural networks, hyperparaters must be tuned to yield the network with the best test error.
+This phenomena, known as "double descent" has been found in neural networks, decision trees, and a few other types of models [@arxiv:1903.07571]. It has been interpreted as indicating that deep neural networks achieve their best performance when they possess enough parameters to smoothly interpolate their training data - resulting in near zero training error while avoiding the over-shoot and under-shoot which is typical of overfitting [@tag:Belkin2019_PNAS; @doi:10.1016/j.neuron.2019.12.002].
+Curiously, a double descent like curve also appears as a function of the number of epochs of training, although this phenomena is hidden by the common practice of early stopping [@arxiv:1912.02292].
+
+The double descent phenomena suggests that when doing deep learning with large data, more parameters is better.
+Finding the optimal number of parameters is one of the goals of hyperparameter tuning, which is the process of finding the optimal architecture and values for input parameters which are not varied during training.
 This is computationally expensive and often not done, however it is important to do when making claims about the superiority of one machine learning method vs. another.
 Several examples have now been uncovered where a new method was said to be superior to a baseline method (like an LSTM or vanilla CNN) but later it was found that the difference went away after sufficient hyperparameter tuning [@tag:Sculley2018].
 A related practice which should be more widely adopted is to perform "ablation studies", where parts of a network are removed and the network is retrained, as this helps with understanding the importance of different components, including any novel ones [@tag:Sculley2018].
@@ -187,8 +190,13 @@ Mordvintsev et al. [@tag:Mordvintsev2015_inceptionism] leveraged caricaturizatio
 Activation maximization can reveal patterns detected by an individual neuron in the network by generating images which maximally activate that neuron, subject to some regularizing constraints.
 This technique was first introduced in Ehran et al. [@tag:Ehran2009_visualizing] and applied in subsequent work [@tag:Simonyan2013_deep; @tag:Mordvintsev2015_inceptionism; @tag:Yosinksi2015_understanding; @tag:Mahendran2016_visualizing].
 Lanchantin et al. [@tag:Lanchantin2016_motif] applied class-based activation maximization to genomic sequence data.
+In the case of image data, naive activation maximization leads to images which look like noise.
+Thus, constraints for smoothness or naturalness priors must be incorporated to get interpretable results [@doi:10.23915/distill.00007].
 One drawback of this approach is that neural networks often learn highly distributed representations where several neurons cooperatively describe a pattern of interest.
 Thus, visualizing patterns learned by individual neurons may not always be informative.
+The distributed nature of representations appears to be related to a curious finding by Szegedy et al. where they found that if they took a linear combination of units from a given layer instead of a single unit (or more precisely perform a random rotation / change in basis), and maximized that instead, they ended up with similar types of visualizations [@arxiv:1312.6199].
+Thus, referring to a single neuron as an "X detector" on the basis of this technique may not accurately reflect the underlying mechanism of the network.
+
 
 #### RNN-specific approaches
 

From 9e6b23e80e1a371fa72e122cf3010d0a62e521b5 Mon Sep 17 00:00:00 2001
From: dan <delton17@gmail.com>
Date: Wed, 12 Aug 2020 10:59:25 -0400
Subject: [PATCH 3/3] improve edits to 06.discussion.md on interpretability,
 tradeoff -> trade-off

---
 content/06.discussion.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/content/06.discussion.md b/content/06.discussion.md
index b45f8c0b..6e895aa1 100644
--- a/content/06.discussion.md
+++ b/content/06.discussion.md
@@ -6,13 +6,13 @@ Here we examine these factors that may impede further progress, ask what steps h
 ### Preventing overfitting and hyperparameter tuning
 
 Some of the challenges in applying deep learning are shared with other machine learning methods.
-In particular, many problem-specific optimizations described in this review reflect a recurring universal tradeoff---controlling the flexibility of a model in order to maximize generalization ability between train and test sets.
+In particular, many problem-specific optimizations described in this review reflect a recurring universal trade-off---controlling the flexibility of a model in order to maximize generalization ability between train and test sets.
 Methods for adjusting the flexibility of deep learning models include dropout, reduced data projections, and transfer learning (described below).
 One way of understanding such model optimizations is that they incorporate external information to limit model flexibility and thereby improve predictions.
-This balance is formally described as a tradeoff between "bias and variance"
+This balance is formally described as a trade-off between "bias and variance"
 [@tag:goodfellow2016deep].
 
-Although the bias-variance trade-off is is important to take into account with many classical statistical learning models with a small number of parameters, recent empirical and theoretical observations suggest that models with a large number of parameters do not the tradeoff as expected [@tag:Belkin2019_PNAS; @tag:Zhang2017_generalization; @tag:Lin2017_why_dl_works].
+Although the bias-variance trade-off is is important to take into account with many classical statistical learning models with a small number of parameters, recent empirical and theoretical observations suggest that models with a large number of parameters do not the trade-off as expected [@tag:Belkin2019_PNAS; @tag:Zhang2017_generalization; @tag:Lin2017_why_dl_works].
 It has been demonstrated that poor generalizability (test error) can often be remedied by adding more layers and increasing the number of free parameters, in conflict with the classic bias-variance theory.
 
 This phenomena, known as "double descent" has been found in neural networks, decision trees, and a few other types of models [@arxiv:1903.07571]. It has been interpreted as indicating that deep neural networks achieve their best performance when they possess enough parameters to smoothly interpolate their training data - resulting in near zero training error while avoiding the over-shoot and under-shoot which is typical of overfitting [@tag:Belkin2019_PNAS; @doi:10.1016/j.neuron.2019.12.002].