From 07bc02d630b7add6615a5f0d108e71ca53b7f323 Mon Sep 17 00:00:00 2001 From: roblanf Date: Thu, 6 Dec 2018 14:40:05 +1100 Subject: [PATCH] update readme to reflect promethion changes --- README.md | 81 ++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 51 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index 128b00b..353f129 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,14 @@ -# Fast and effective quality control for MinION sequencing data +# Fast and effective quality control for MinION and PromethION sequencing data - [What and Why?](https://github.com/roblanf/minion_qc#what-and-why) - [Quick start](https://github.com/roblanf/minion_qc#quick-start) - [Commandline options](https://github.com/roblanf/minion_qc#commandline-options) - [Installation](https://github.com/roblanf/minion_qc#installation) - [Dependencies](https://github.com/roblanf/minion_qc#dependencies) -- [Output details](https://github.com/roblanf/minion_qc#output-details) +- [Output details for MinION](https://github.com/roblanf/minion_qc#output-details-for-minion) - [Analysing a single flowcell](https://github.com/roblanf/minion_qc#analysing-a-single-flowcell) - [Analysing multiple flowcells](https://github.com/roblanf/minion_qc#analysing-multiple-flowcells) +- [Output details for PromethION](https://github.com/roblanf/minion_qc#output-details-for-promethion) ## Citation @@ -18,7 +19,7 @@ https://doi.org/10.1093/bioinformatics/bty654 ## What and Why? -This script will give you a range of diagnostic plots and data for quality control of sequencing data from Oxford Nanopore's MinION sequencer. +MinIONQC gives you a range of diagnostic plots and data for quality control of sequencing data from Oxford Nanopore's MinION and PromethION sequencer. There are lots of tools that do related things, but they mostly focus on getting data out of the fastq or fast5 files, which is slow and computationally intensive. The benefit of MinIONQC is that it works directly with the `sequencing_summary.txt` files produced by ONT's Albacore or Guppy base callers. This makes `MinIONQC` a lot quicker than most other things out there, and crucially allows the quick-and-easy comparison of data from multiple flowcells. For example, it takes about a minute to analyse a 4GB flowcell using a single processor on my laptop. @@ -26,7 +27,7 @@ If you don't already have `sequencing_summary.txt` files for your data, you can ## Quick start -The input for the script is one or more `sequencing_summary.txt` files produced by ONT's Albacore or Guppy basecalles, based on data from one or more MinION flowcells. +The input for the script is one or more `sequencing_summary.txt` files produced by ONT's Albacore or Guppy basecalles, based on data from one or more MinION or PromethION flowcells. MinIONQC autodetects which kind of flowcell your data came from. To run it on one `sequencing_summary.txt` file, just point it to a single `sequencing_summary.txt` file like this: @@ -45,7 +46,7 @@ The script will simply look for all `sequencing_summary.txt` files recursively i * `MinIONQC.R`: path to this script * `path/to/parent_directory`: path to an input directory that contains one or more `sequencing_summary.txt` files in sub-directories -You'll see a series of plots in the output directory, and a YAML file that describes your output. These, and other command line options, are described below. +You'll see a series of plots in the output directory, and a YAML file that describes your output (you can open this in any text editor). These, and other command line options, are described below. Note: for direct RNA runs, any reads from the control RNA sequence (i.e. anything in your summary file labelled "YHR174W") are removed prior to analysis. @@ -116,19 +117,19 @@ install.packages(c("data.table", If you want to run the example input, one option is to change directories to the file containing the `MinonQC.R` script and type: ``` -Rscript MinIONQC.R -i example_input -o my_example_output -p 2 +Rscript MinIONQC.R -i example_input_minion -o my_example_output_minion -p 2 ``` -## Output details +## Output details for MinION The following output was created by running the script on the example input files, which contains data from two flowcells from our lab. ``` -Rscript MinIONQC.R -i example_input -o example_output -s TRUE -p 2 +Rscript MinIONQC.R -i example_input_minion -o example_output_minion -s TRUE -p 2 ``` This runs the analysis with two processors, and produces smaller plots suitable for presentations or papers (where possible). The defualt (i.e. removing the `-s` option above) is to produce larger plots designed for viewing on full size monitors. -Two kinds of output are produced. Output for each flowcell, and then additional output for the combined flowcells to allow for comparison. The script will produce 10 files to describe each flowcell, and 9 files to describe all flowcells combined (if you have analysed more than one flowcell). I explain each of these files below, with examples from the `example_output/RB7_A2/` folder for a single flowcell, and examples from the `example_output/combinedQC/` folder for multiple flowcells. +Two kinds of output are produced. Output for each flowcell, and then additional output for the combined flowcells to allow for comparison. The script will produce 10 files to describe each flowcell, and 9 files to describe all flowcells combined (if you have analysed more than one flowcell). I explain each of these files below, with examples from the `example_output_minion/RB7_A2/` folder for a single flowcell, and examples from the `example_output_minion/combinedQC/` folder for multiple flowcells. There are two main colour schemes used in the plots: @@ -204,86 +205,106 @@ notes: ultralong reads refers to the largest set of reads with N50>100KB #### length_histogram.png Read length on a log10 scale (x-axis) vs counts (y-axis). This is a standard plot for long-read sequencing. Although it's obviously useful, it still doesn't tell you how much data (i.e. your total yield) you have for reads above a given length though. For that, see the `yield_by_length` and `yield_over_time` plots. Of note in our data are the large number of very short reads. We don't think these are actually DNA fragments. Instead, we think they are contaminant molecules blocking pores (see below for more on this). In any case, it is exactly this kind of observation that led us to continue developing these QC tools. Knowing what's holding your performance back is key to getting better. -![length_histogram](example_output/RB7_A2/length_histogram.png) +![length_histogram](example_output_minion/RB7_A2/length_histogram.png) #### q_histogram.png Mean Q score for a read (x-axis) vs counts (y-axis). We frequently observe a collection of 'good' reads with Q scores greater than about 7, and a collection of 'bad' reads, which Q scores that cluster around 4. Typically, one might filter the 'bad' reads out before assembly, but there's good evidence in the literature that they contain useful information if you treat them right. -![q_histogram](example_output/RB7_A2/q_histogram.png) +![q_histogram](example_output_minion/RB7_A2/q_histogram.png) #### length_vs_q.png Read length on a log10 scale (x-axis) vs mean Q score (y-axis). Points are coloured by the events per base. 'Good' reads are ~1.5 events per base, and 'bad' reads are >>1.5 events per base. We often see a group of very short, 'bad', low-quality reads. We think this is something to do with our DNA extractions, becuase not everybody gets the same thing. In this plot, the point size, transperency, and plot size are always the same no matter the input data. This facilitates comparison of these plots among flowcells and labs - those with more reads will look darker because there will be more points. If you have a 1D2 run, there will be no colours on this plot, because Albacore doesn't report the number of events per read when it combines the two reads of a 1D2 run into a single read. -![length_vs_q](example_output/RB7_A2/length_vs_q.png) +![length_vs_q](example_output_minion/RB7_A2/length_vs_q.png) #### length_by_hour.png The mean read length (y-axis) over time (x-axis). This let's you see if you are running out of longer reads as the run progresses. Muxes, which occur every 8 hours, are shown as red dashed lines. -![length_by_hour](example_output/RB7_A2/length_by_hour.png) +![length_by_hour](example_output_minion/RB7_A2/length_by_hour.png) #### q_by_hour.png The mean Q score (y-axis) over time (x-axis). We often see that our Q scores drop noticably over time - presumably this is a result of the pores wearing out, or the DNA accumulating damage, or both. Muxes, which occur every 8 hours, are shown as red dashed lines -![q_by_hour](example_output/RB7_A2/q_by_hour.png) +![q_by_hour](example_output_minion/RB7_A2/q_by_hour.png) #### reads_per_hour.png The number of reads (y-axis) obtained in each hour (x-axis). Muxes (every 8 hours) are plotted as red dashed lines. You can typically see that each mux results in a noticable increase in the number of reads per hour. -![q_by_hour](example_output/RB7_A2/reads_per_hour.png) +![q_by_hour](example_output_minion/RB7_A2/reads_per_hour.png) #### yield_by_length.png The total yield in bases (y-axis) for any given minimum read length (x-axis). This is just like the 'reads' table in the `summary.yaml` output, but done across all read lengths up to the read length that includes 99% of the total yield. For example, to read off the amount of bases you have sequenced from reads of at least 25KB, just go up from 25KB on the x-axis to the line, then left to the y-axis, and you should get an answer of ~2.5GB. This can be particularly useful when your aim is to achieve a particular total yield of reads longer than some predefined length from a series of flowcells. This is often the case for genome sequencing projects. -![yield_by_length](example_output/RB7_A2/yield_by_length.png) +![yield_by_length](example_output_minion/RB7_A2/yield_by_length.png) #### yield_over_time.png The total yield (y-axis) over the time that the flowcell was run. This can help to identify any issues that occurred during the run of a particular flowcell. Muxes are shown as dashed red lines. This one looks fine, and shows the expected boosts from each mux. -![yield_over_time](example_output/RB7_A2/yield_over_time.png) +![yield_over_time](example_output_minion/RB7_A2/yield_over_time.png) #### channel_summary.png Histograms of total bases, total reads, mean read length, and median read length that show the variance across the 512 available channels. Repeated for all data and reads with Q>10. -![channel_summary](example_output/RB7_A2/channel_summary.png) +![channel_summary](example_output_minion/RB7_A2/channel_summary.png) + #### flowcell_overview.png The 512 channels are laid out as on the R9.5 flowcell. Each panel of the plot shows time on the x-axis, and read length on the y-axis. Points are coloured by the Q score. This gives a little insight into exactly what was going on in each of your channels over the course of the run. You'll notice that in the example output for `RB7_D3` (the second plot below) you can see clearly that there was a bubble on the right-hand-side of the flowcell. The other thing of note in these plots is the frequent (and sometimes extended) periods in which some pores produce only very short, very low quality 'reads'. Our current best guess is that this is due to residual contaminants in our DNA extractions blocking the pores. A blocked pore looks like a change in current. And if the blockage is persistent (e.g. a large molecule just sitting blocking the pore, occasionally letting some current through) this could produce exactly this kind of pattern. Hopefully you don't see this in your samples. We work with plants, so this is the best we've been able to do so far. -![flowcell_channels_epb](example_output/RB7_A2/flowcell_overview.png) -![flowcell_channels_epb](example_output/RB7_D3/flowcell_overview.png) +![flowcell_channels_epb](example_output_minion/RB7_A2/flowcell_overview.png) +![flowcell_channels_epb](example_output_minion/RB7_D3/flowcell_overview.png) + + +#### gb_per_channel_overview.png +This is really just a summary of the flowcell_overview plot. It shows the number of gigabases sequenced for each channel on the flowcell, with channels organised according to their physical distribution on the flowcell. The two panels show all reads (left) and all reads above your chosen Q score cutoff (right). +![gb_per_channel_overview.png](example_output_minion/RB7_A2/gb_per_channel_overview.png) ### Analysing multiple flowcells -9 files are produced that summarise the combined data across all flowcells. Examples are in the `example_output/combinedQC/` folder. +9 files are produced that summarise the combined data across all flowcells. Examples are in the `example_output_minion/combinedQC/` folder. #### summary.yaml As above, but for all data combined across flowcells. Useful for knowing where your project is up to so far. #### combined_length_histogram.png Read length, on a log10 scale, from the combined data on the x-axis, and read counts on the y-axis. -![combined_length_histogram](example_output/combinedQC/combined_length_histogram.png) +![combined_length_histogram](example_output_minion/combinedQC/combined_length_histogram.png) #### combined_q_histogram.png Mean Q score for a read on the x-axis, and counts on the y-axis. From the combined data across all flowcells. -![combined_q_histogram](example_output/combinedQC/combined_q_histogram.png) +![combined_q_histogram](example_output_minion/combinedQC/combined_q_histogram.png) #### combined_yield_by_length.png The total yield (y-axis) for any given minimum read length (x-axis), from all data combined. As above, the maximum read length in the plot is the one that includes 99% of the total yield. -![combined_yield_by_length](example_output/combinedQC/combined_yield_by_length.png) +![combined_yield_by_length](example_output_minion/combinedQC/combined_yield_by_length.png) #### length_distributions.png Read length on a log10 scale (x-axis) vs density (y-axis). One line per flowcell. This allows for comparison of read length distributions across flowcells, but it's hard to use these kinds of plots to compare yields, because the height of a plot depends on how much of the read distribution focussed in that area. To compare yields more directly, use the `yield_by_length` and `yield_over_time` plots. -![length_distributions](example_output/combinedQC/length_distributions.png) +![length_distributions](example_output_minion/combinedQC/length_distributions.png) #### q_distributions.png Mean Q score of a read (x-axis) vs density (y-axis). One line per flowcell. -![q_distributions](example_output/combinedQC/q_distributions.png) +![q_distributions](example_output_minion/combinedQC/q_distributions.png) #### length_by_hour.png The readlength (y-axis) over time (x-axis). Muxes, which occur every 8 hours, are shown as red dashed lines -![length_by_hour](example_output/combinedQC/length_by_hour.png) +![length_by_hour](example_output_minion/combinedQC/length_by_hour.png) #### q_by_hour.png The mean Q score accross reads (y-axis) over time (x-axis). Muxes, which occur every 8 hours, are shown as red dashed lines -![q_by_hour](example_output/combinedQC/q_by_hour.png) +![q_by_hour](example_output_minion/combinedQC/q_by_hour.png) #### yield_by_length.png The total yield (y-axis) for any given minimum read length (x-axis). Each flowcell has its own colour. All reads are in the top panel, and just the reads above your Q cutoff are in the bottom panel. This is just like the 'reads' table in the `summary.yaml` output, but done across all read lengths up to the read length that includes 99% of the total yield for the flowcell with the highest total yield. The comparison of the two flowcells below shows the effect of using a blue pippen for size selection, removing fragments <20KB. For the one in blue, we used a bead-based size selection which removes just the smallest fragments <~1KB. The result is that the two flowcells have very similar overall yields, but quite different profiles. -![yield_by_length](example_output/combinedQC/yield_by_length.png) +![yield_by_length](example_output_minion/combinedQC/yield_by_length.png) #### yield_over_time.png The total yield (y-axis) over the time that the flowcell was run (x-axis). This can help to identify any issues that occurred during the run of a particular flowcell. Muxes are shown as dashed red lines. This plot shows that something happened to flowcell RB7_D3 at ~17 hours, which stopped it from working until the next mux at 24 hours. -![yield_over_time](example_output/combinedQC/yield_over_time.png) +![yield_over_time](example_output_minion/combinedQC/yield_over_time.png) + + +## Output details for PromethION + +Most of the plots and files for a PromethION flowcell are the same as for a MinION flowcell. However, given the huge volumes of data produced by a single PromethION flowcell, there are a couple of important differences. Below I just show the plots that differ from those described above. One thing to note is that the flowcell overview plot will not be made - it's just not possible put a point for each read on one plot in a way that is actually useful. + +#### gb_per_channel_overview.png +Number of gigabases sequenced for each channel on the flowcell, with channels organised according to their physical distributino on the flowcell. The two panels show all reads (top) and all reads above your chosen Q score cutoff (bottom). +![gb_per_channel_overview.png](example_output_promethion/gb_per_channel_overview.png) + +#### length_vs_q.png +Read length on a log10 scale (x-axis) vs mean Q score (y-axis). The colour shows the number of reads in each region of the plot. +![length_vs_q](example_output_promethion/length_vs_q.png) +