Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 06.know-your-problem.md #258

Merged
merged 7 commits into from
Oct 16, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 34 additions & 41 deletions content/06.know-your-problem.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,42 @@
## Tip 4: Know your data and your question {#know-your-problem}

Having a well defined scientific question and a clear analysis plan is crucial for carrying out a successful deep learning project.
Just like it would be inadvisable to step foot in a laboratory and begin experiments without having a defined endpoint, a deep learning project should not be undertaken without preparation.
Foremost, it is important to assess if a dataset exists that can answer the biological question of interest for the given deep learning model; obtaining said data and associated metadata and reviewing the study protocol should be pursued as early on in the project as possible.
A publication or resource might purportedly offer a dataset that seems to be a good fit to test your hypothesis, but the act of obtaining it can reveal numerous problems.
It may be unstructured when it is supposed to be structured, crucial metadata such as sample stratification are missing, or the usable sample size is different than what is reported.
Data collection should be documented or a data collection protocol should be created and specified in the project documentation.
Information such as the resource used, the date downloaded, and the version of the dataset, if any, will help minimize operational confusion and will allow for transparency during the publication process.
Just like it would be inadvisable to set foot in a laboratory and begin experiments without having a defined endpoint, a deep learning project should not be undertaken without defined goals.
Foremost, it is important to assess if a dataset exists that can answer the biological question of interest using a deep learning-based approach.
If so, obtaining this data (and associated metadata), and reviewing the study protocol, should be pursued as early on in the project as possible.
This can help to ensure that data is as expected and can prevent the wasted time and effort that occur when issues are discovered later on in the analytic process.
For example, a publication or resource might purportedly offer an appropriate dataset that is found to be inadequate upon aquisition.
The data may be unstructured when it is supposed to be structured, crucial metadata such as sample stratification might be missing, or the usable sample size may be different than expected.
Any of these data issues might limit a researcher's ability to use DL to address the biological question at hand, or might otherwise require adjustment before DL can be used.
Data collection should also be carefully documented, or a data collection protocol should be created and specified in the project documentation.
Information about the resources used, download dates, and dataset versions are critical to preserve.
Doing so will help to minimize operational confusion and will increase the reproducibility of the analysis.

Once the dataset is obtained, it is easy to begin analysis without a good understanding of the study design, namely why the data was collected and how.
Metadata has been standardized in many fields and can help with this (for example, see [@doi:10.1038/ng1201-365]), but if at all possible, seek out a subject matter expert who has experience with this type of data.
Receiving first-hand knowledge of the “gotchas" of a dataset will minimize the amount of guesswork and increase the success rate of a deep learning project.
For example, if the main reason why the data was collected was to test the impact of an intervention, then it may be the case that a randomized controlled trial was performed.
However, it is not always possible to perform a randomized trial for ethical or practical reasons.
Therefore, an observational study design is often considered, with the data either prospectively or retrospectively collected.
In order to ensure similar distributions of important characteristics across study groups in the absence of randomization, individuals may be matched based on age, gender, or weight.
Study designs will often have different assumptions and caveats, and these cannot be ignored during a data analysis.
Many datasets are now passively collected or do not have a specific design, but even in this case it is important to know how individuals or samples were treated.
Samples originating from the same study site, oversampling of ethnic groups or zip codes, and sample processing differences are all sources of variation that need to be accounted for.

In all cases, investigators should consider the extent to which the outcome of interest is likely to be predictable from the input data and begin by thoroughly inspecting the input data.
Data exploration with unsupervised learning and data visualization can reveal the biases and technical artifacts in these datasets, providing a critical first step to assessing data quality before any deep learning model is applied.
In some cases, these analyses can identify biases from known technical artifacts or sample processing which can be corrected through preprocessing techniques to support more accurate application of deep leaning models for subsequent prediction or feature identification problems from those datasets.

Systematic biases, which can be induced by confounding variables, for example, can lead to artifacts or so-called "batch effects."
As a consequence, models may learn to rely on correlations that are irrelevant in the scientific context of the study and may result in misguided predictions and misleading conclusions [@doi:10.1038/nrg2825].
Once the dataset is obtained, it is important to learn why and how the data were collected before beginnig analysis.
The standardized metadata that exist in many fields can help with this (for example, see [@doi:10.1038/ng1201-365]), but if at all possible, seek out a subject matter expert who has experience with the type of data you are using.
Doing so will minimize guesswork and is likely to increase the success rate of a deep learning project.
For example, one might presume that data collected to test the impact of an intervention derives from a randomized controlled trial.
However, this is not always the case, as ethical or practical concerns often necessitate an observational study design that features prospectively or retrospectively collected data.
In order to ensure similar distributions of important characteristics across study groups in the absence of randomization, such a study may have selected individuals in a fashion that best matches attributes such as age, gender, or weight.
Passively collected datasets can have their own peculiarities, and other study designs can include samples that originate from the same study site, the oversampling of ethnic groups or zip codes, or sample processing differences.
Such information is critical to accurate data analysis, and so it is imperative that practitioners learn about study design assumptions and data specificites prior to performing modeling.
Other study design considerations that should not be overlooked include knowing whether a study involves biological or technical replicates or both.
For example, are some samples collected from the same individuals at different time points?
Are those time points before and after some treatment?
If one assumes that all the samples are independent but that is in fact not the case, a variety of issues may arise, including having a lower effective sample size than expected.
For example, the existence in a dataset of samples collected from the same individuals at different time points can have signficant effects on analyses that make assumptions about sample size and independence (that is, non-independence can lower the effective sample size).
Another potential issue is the existence of systematic biases, which can be induced by confounding variables and can lead to artifacts or so-called "batch effects."
As a consequence, models may learn to rely on the correlations that these systematic biases underpin, even though they are irrelevant to the scientific context of the study.
This can lead to misguided predictions and misleading conclusions [@doi:10.1038/nrg2825].
As described in [Tip 1](#concepts), unsupervised learning and other exploratory analyses can help to identify such biases in these datasets prior to applying a deep learning model.

In general, deep learning has an increased tendency for overfitting compared to classical methods, due to the large number of parameters being estimated, making issues of adequate sample size even more important (see [Tip 7](#overfitting)).
For a large dataset overfitting may not be a concern, but the modeling power of deep learning may lead to more spurious correlations and thus incorrect interpretation of results (see [Tip 9](#interpretation)).
It is important to note that molecular or imaging datasets often require appropriate clinical or demographic data to support robust analyses; this must always be balanced with the need to protect patient privacy (see [Tip 10](#privacy)).
Looking at these annotations can also clarify the study design (for example, by seeing if all the individuals are adolescents or women) or at least help the analyst employing deep learning to know what questions to ask.
In general, deep learning has an increased tendency towards overfitting (that is, perform well on the training data but not generalize well) compared to classical methods.
This is often a result of the large number of parameters being estimated in a DL model, and has direct implications for the importance of working with adequate sample sizes (see [Tip 7](#overfitting)).
For a large dataset, overfitting may not be a concern, but the modeling power of deep learning may lead to more spurious correlations, and thus incorrect interpretations or conclusions (see [Tip 9](#interpretation)).
It is important to note that molecular or imaging datasets often require appropriate clinical or demographic data to support robust analyses, although this must always be balanced with the need to protect patient privacy (see [Tip 10](#privacy)).
Looking at these annotations can also clarify the study design (for example, checking for imbalances in sample age or gender), which can help the practitioner identify which questions are most appropriate.

Data simulation is a powerful approach to develop an understanding of how data and analytical methods interact.
In data simulation, a model is used to learn the true distribution of a training set for the purpose of creating new data points.
Often, researchers may perform simulations under some assumptions about the data generating process to identify useful model architectures and hyperparameters.
Simulated datasets can be used to verify the correctness of a model’s implementation.
To accurately test the performance of the model, it is important that simulated datasets be generated for a range of parameters.
For example, varying the parameters to violate the model's assumptions can test the sensitivity of the model's performance.
Parameter tuning the simulation can help researchers identify the key features that drive method performance.
In other cases, neural networks can be used to simulate data to better understand how to structure analyses.
For example, it is possible to study how analytical strategies cope with varying number of noise sources by using neural networks to simulate genome-wide data [@doi:10.1101/2020.05.03.066597].
Simulating data from assumptions about the data generating distribution can help to debug or characterize deep learning models, and deep learning models can also simulate data in cases where it is hard to make reasonable assumptions from first principles.
Finally, data simulation is a powerful approach for creating additional data with which to test models.
In such a scenario, a model is used to estimate the distribution of a training set, and this estimated distribution is then used to create new data points.
Simulated data can be used to verify the correctness of a model’s implementation, as well as to identify useful model architectures and hyperparameters.
Therefore, simulations should be performed under reasonable assumptions, and across a wide range of parameters.
For example, varying the parameters so widely as to to violate the model's assumptions can allow for the testing the model robustness and sensitivity to these assumptions.

Basically, thoroughly study your data and ensure that you understand its context and peculiarities _before_ jumping into deep learning.
Overall, practitioners should make sure to thoroughly study their data and understand its context and peculiarities _before_ moving on to performing deep learning.