Analysis template for analysing, visualising and communicating the findings of Illumina Small RNA sequencing data with a focus on small non-coding RNA's that are sometimes forgotten (ie. looking at and beyond miRNA's). This is set up to analyse human data, but can be adapted to other species with some tweaks to the code, namely the pipeline inputs. This template uses the count data and other QC from both the smrnaseq and exceRpt pipelines, so I'm assuming you've run these pipelines on your data before using this template. See a live example github page here.
- smncRNA analysis template
- Table of contents
- What this template can do
- What this template can't do
- What's this template gonna do?
- Prerequisite software
- Testing
- How to use this template
- 1. Fork the template repo to a personal or lab account
- 2. Take this template to the data on your local machine
- 3. Format your input files
- 4. Run the template
- 5. Commit and push to your forked version of the github repo
- 6. Repeat step 5 each time you re-run the analysis with different parameters
- 7. Create a github page (optional)
- 8. Contribute back!
This template uses open source tools and includes several scripts for researchers to analyse, explore and communicate findings to other researchers through interactive tables and plots. The results of this template can be served as a github page that renders html files and provides links to RShiny apps hosted on shinyapps.io - this means a single weblink can be given to your collaborators to provide them with all your analysis code and results. Most importantly, this template puts you in good steed to ensure your analysis is reproducible!
- Tell you what analysis tools and parameters are appropriate for your data or research question, the assumption is that the tools this template uses are tools you've intentionally chosen to use and that you will actively adapt this template for your use-case
- Account for different operating systems and compute infrastructures - this means some UNIX experience may be required to run the pipelines/scripts on your operating system or job scheduler. I won't tell you how to do this here, but the pipelines and tools used here are generally portable (ie. able to be run on different operating systems) and I've used renv environments to make the R code more portable
Beyond the QC the pipelines undertake, additional QC is undertaken to summarise the read counts and mapping rates of the data.
These counts datasets output by the pipelines are analysed in R to undertake a differential expression analysis of all these RNA species to find differently expressed RNA's. Two methods were employed to undertake a differential expression analysis, namely limma/voom and deseq2.
Beyond a traditional differential expression analysis, the data is prepared and presented in an interactive RShiny app that allows the user to explore RNA expression (both raw counts and counts per million).
Interactive MDS and PCA plots are also created to explore clusters of RNA's/samples in the data.
Lastly, the composition of the RNA species that are identified in each treatment group are compared.
python 3, cut, zgrep, GNU software, pandas, PyYAML, natsort, R, git
This template has been validated to work on:
- Pipeline outputs created using nextflow 21.04.0, singularity 3.7.2, smrnaseq version 1.1.0, excerpt version 4.3.2
- R version 4.0.5
- CentOS Linux 7
Test data available in the test directory, including fastq data, associated metadata, smrnaseq pipeline outputs and excerpt pipeline outputs run on this test fastq data
Clone the forked smncrna_analysis_template repo to the machine you'd like to analyse the data on
git clone https://github.com/leahkemp/smncrna_analysis_template.git
sample.fastq.gz
- one fastq file per sample
- sample name matching the sample names in the metadata file
- ".fastq.gz" extension
For example see the test fastq files here
Required columns:
-
"sample"
- must be titled "sample"
- must contain a row with a unique sample name/id for each fastq file present in the directory of fastq files to be analysed
-
"treatment"
- must be titled "treatment"
Additional columns:
- you can include additional columns with categorical variables beyond the "treatment" column
- these columns will be used in the expression_plotting app so the user can explore raw counts and counts per million by "treatment" or by these additional variables
Other notes:
- you can't have any duplicate column names eg. two columns named "sample" and "Sample"
- make sure every sample in the fastq directory to be analysed is included in the metadata file and is associated with a treatment group
For example see the test metadata file here
Set up ./config/config.yaml
Run/work through the master RMarkdown file, this will do the bulk of the analyses and generate several html file for data visualisation and csv files with processed data
To maintain reproducibility of your analysis, commit and push:
- All configuration files
- All run scripts
- All your documentation/notes
Push all the all the html/Rmd/txt files that you'd like to be rendered into a github page to github. Then create a github page. See this example live page and the underlying index.md file that links all the html/Rmd/txt files included in the github page based on the files produced by this template.
- Raise issues in the issues page
- Create feature requests in the issues page
- Ask questions or start general discussion about this repo in the discussions page
- Contribute your code! Create your own branch from the development branch and create a pull request to the development branch once the code is on point!
Contributions and feedback are always welcome! 😊