Merge branch 'main' into use-tools-pkg

abcsFrederick · Aug 26, 2024 · b67417e · b67417e
2 parents 6dadb0a + 181a4bf
commit b67417e
Show file tree

Hide file tree

Showing 5 changed files with 240 additions and 42 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,7 @@
 - Add GUI instructions to the documentation website. (#38, @samarth8392)
 - The docs website now has a dropdown menu to select which version to view. The latest release is shown by default. (#150, @kelly-sovacool)
 - Show the name of the pipeline rather than the python script for CLI help messages. (#131, @kelly-sovacool)
+- Added Expected output tab to the documentation website and updated FAQs (#156, @samarth8392)
 
 ## RENEE 2.5.12
 

diff --git a/docs/RNA-seq/images/gui_nx_renee.png b/docs/RNA-seq/images/gui_nx_renee.png
diff --git a/docs/RNA-seq/output.md b/docs/RNA-seq/output.md
@@ -0,0 +1,207 @@
+After a successful `renee` run execution for multisample paired-end data, the following files and folders are created in the output folder.
+
+```bash
+renee_output/
+├── bams
+├── config 
+├── config.json # Contains the configuration and parameters used for this specific RENEE run
+├── DEG_ALL
+├── dryrun.{datetime}.log # Output from the dry-run of the pipeline
+├── FQscreen
+├── FQscreen2
+├── fusions
+├── kraken
+├── logfiles 
+├── nciccbr 
+├── preseq 
+├── QC
+├── QualiMap
+├── rawQC
+├── Reports
+├── resources
+├── RSeQC
+├── sample1.R1.fastq.gz -> /path/to/input/fastq/files/sample1.R1.fastq.gz
+├── sample1.R2.fastq.gz -> /path/to/input/fastq/files/sample1.R2.fastq.gz
+...
+..
+.
+├── sampleN.R1.fastq.gz -> /path/to/input/fastq/files/sampleN.R1.fastq.gz
+├── sampleN.R2.fastq.gz -> /path/to/input/fastq/files/sampleN.R2.fastq.gz
+├── STAR_files
+├── trim
+└── workflow
+```
+
+## Folder details and file descriptions
+
+### 1. `bams`
+
+Contains the STAR aligned reads for each sample analyzed in the run.
+
+```bash
+/bams/
+├── sample1.fwd.bw # forward strand bigwig files suitable for a genomic track viewer like IGV
+├── sample1.rev.bw # reverse strand bigwig files 
+├── sample1.p2.Aligned.toTranscriptome.out.bam # BAM alignments to transcriptome using STAR in two-pass mode
+├── sample1.star_rg_added.sorted.dmark.bam # Read groups added and duplicates marked genomic BAM file (using STAR in two-pass mode)
+├── sample1.star_rg_added.sorted.dmark.bam.bai
+...
+..
+.
+```
+
+### 2. `config`
+
+Contains config files for the pipeline.
+
+
+### 3. `DEG_ALL`
+
+Contains the output from RSEM estimating gene and isoform expression levels for each sample and also combined data matrix with all samples.
+
+```bash
+/DEG_ALL/
+├── combined_TIN.tsv # RSeQC logfiles containing transcript integrity number information for all samples
+├── RSEM.genes.expected_count.all_samples.txt # Expected gene counts matrix for all samples (useful for downstream differential expression analysis)
+├── RSEM.genes.expected_counts.all_samples.reformatted.tsv # Expected gene counts matrix for all samples with reformatted gene symbols (format: ENSEMBLID | GeneName)
+├── RSEM.genes.FPKM.all_samples.txt # FPKM Normalized expected gene counts matrix for all samples 
+├── RSEM.genes.TPM.all_samples.txt # TPM Normalized expected gene counts matrix for all samples
+├── RSEM.isoforms.expected_count.all_samples.txt # File containing isoform level expression estimates for all samples.
+├── RSEM.isoforms.FPKM.all_samples.txt # FPKM Normalized expected isoform counts matrix for all samples 
+├── RSEM.isoforms.TPM.all_samples.txt # TPM Normalized expected isoform counts matrix for all samples
+├── sample1.RSEM.genes.results # Expected gene counts for sample 1
+├── sample1.RSEM.isoforms.results # Expected isoform counts for sample 1
+├── sample1.RSEM.stat # RSEM stats for sample 1
+│   ├── sample1.RSEM.cnt 
+│   ├── sample1.RSEM.model
+│   └── sample1.RSEM.theta
+├── sample1.RSEM.time # Run time log for sample 1
+...
+..
+.
+├── sampleN.RSEM.genes.results
+├── sampleN.RSEM.isoforms.results
+├── sampleN.RSEM.stat
+│   ├── sampleN.RSEM.cnt
+│   ├── sampleN.RSEM.model
+│   └── sampleN.RSEM.theta
+└── sampleN.RSEM.time
+
+```
+
+### 4. `FQScreen` and `FQScreen2`
+
+These folders contain results from quality-control step to screen for different sources of contamination. FastQ Screen compares your sequencing data to a set of different reference genomes to determine if there is contamination. It allows a user to see if the composition of your library matches what you expect. These results are plotted in the multiQC report.
+
+### 5. `fusions`
+
+Contains gene fusions output for each sample.
+
+```bash
+fusions/
+├── sample1_fusions.arriba.pdf
+├── sample1_fusions.discarded.tsv # Contains all events that Arriba classified as an artifact or that are also observed in healthy tissue. 
+├── sample1_fusions.tsv # Contains fusions for sample 1 which pass all of Arriba's filters. The predictions are listed from highest to lowest confidence. 
+├── sample1.p2.arriba.Aligned.sortedByCoord.out.bam # Sorted BAM file for Arriba's Visualization
+├── sample1.p2.arriba.Aligned.sortedByCoord.out.bam.bai
+├── sample1.p2.Log.final.out # STAR final log file
+├── sample1.p2.Log.out # STAR runtime log file
+├── sample1.p2.Log.progress.out # log files
+├── sample1.p2.Log.std.out # STAR runtime output log
+├── sample1.p2.SJ.out.tab #  Summarizes the high confidence splice junctions for sample 1
+├── sample1.p2._STARgenome # Extra files generated during STAR aligner 
+│   ├── exonGeTrInfo.tab
+│   ├── .
+│   ├── .
+│   └── transcriptInfo.tab 
+├── sample1.p2._STARpass1 # Extra files generated during STAR first pass 
+│   ├── .
+│   └── .
+...
+..
+.
+
+```
+
+### 6. `kraken`
+
+Contains per sample kraken output files which is a Quality-control step to assess for potential sources of microbial contamination. Kraken is used in conjunction with Krona to produce an interactive reports stored in `.krona.html` files. These results are present in the multiQC report.
+
+### 7. `logfiles`
+
+Contains logfiles for the entire RENEE run, job error/output files for each individual job that was submitted to SLURM, and some other stats generated by different software. Important to diagnose errors if the pipeline fails. The per sample stats information is present in the mulitQC report. 
+
+```bash
+/logfiles/
+├── master.log # Logfile for the main (master) RENEE job
+├── mjobid.log # SLURM JOBID for the master RENEE job
+├── runtime_statistics.json # Runtime statistics for each rule in the RENEE run
+├── sample1.flagstat.concord.txt # sample mapping stats
+├── sample1.p2.Log.final.out # sample STAR alignment stats
+├── sample1.RnaSeqMetrics.txt # sample stats collected by Picard CollectRnaSeqMetrics
+├── sample1.star.duplic # Mark duplicate metrics
+...
+..
+.
+├── slurmfiles
+│   ├── {MASTER_JOBID}.{JOBID}.{rule}.{wildcards}.out
+│   ├── {MASTER_JOBID}.{JOBID}.{rule}.{wildcards}.err
+│   ...
+│   ..
+│   .
+├── snakemake.log # The snakemake log file which documents the entire pipeline log
+├── snakemake.log.jobby # Detailed summary report for each individual job. 
+└── snakemake.log.jobby.short # Short summary report for each individual job. 
+```
+
+### 8. `nciccbr`
+
+Contain Arriba resources for gene fusion estimation. Manually curated and only exist for a few reference genomes (mm10, hg38, hg19).
+
+### 9. `preseq`
+
+Contains library complexity curves for each sample. These results are part of the multiQC report.
+
+### 10. `QC` and `rawQC`
+
+Contains per sample output from FastQC for raw and adapter trimmed fastq files with insert size estimates. These results are part of the multiQC report.
+
+### 11. `QualiMap`
+
+Contains per sample output for Quality-control step to assess various post-alignment metrics and a secondary method to calculate insert size. These results are part of the multiQC report.
+
+### 12. `Reports`
+
+Contains the multiQC report which visually summarizes the quality control metrics and other statistics for each sample (`multiqc_report.html`). All the data tables used to generate the multiQC report is available in the `multiqc_data` folder.  The `RNA_report.html` file is an interactive report the aggregates sample quality-control metrics across all samples. This interactive report to allow users to identify problematic samples prior to downstream analysis. It uses flowcell and lane information from the FastQ file.
+
+### 13. `resources`
+
+Contains resources necessary to run the RENEE pipeline.
+
+### 14. `RSeQC`
+
+Contains various QC metrics for each sample collected by RSeQC. These results are part of the multiQC report.
+
+### 15. `STAR_files`
+
+Contains log files, splice junction tab file (`SJ.out.tab`), and `ReadsPerGene.out.tab` file, and other various output files for each sample generated by STAR aligner.
+
+### 16. `trim`
+
+Contains adapter trimmed FASTQ files for each sample used for all the downstream analysis.
+
+```bash
+trim
+├── sample1.R1.trim.fastq.gz
+├── sample1.R2.trim.fastq.gz
+...
+..
+.
+├── sampleN.R1.trim.fastq.gz
+└── sampleN.R2.trim.fastq.gz
+
+```
+
+### 17. `workflow`
+
+Contains the RENEE pipeline workflow.
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -4,32 +4,21 @@ We have compiled this FAQ from the most common problems. If you are running into
 
 ## Job Status
 
-**Q. How do I know if RNA-seek finished running successfully?**
+**Q: How do I know if RENEE pipeline finished running? How to check status of each job?**
 
-**A.** There are several different ways of checking the status of each job submitted to the cluster.  
-Here are a few suggestions:
+**A.** Once the pipeline is done running to completion, you will receive an email with header like
+
+`Slurm Job_id=xxxx Name=pl:renee Ended, Run time xx:xx:xx, COMPLETED, ExitCode 0`
+
+To check the status of each individual job submitted to the cluster, there are several different ways. Here are a few suggestions:
 
 !!! tldr "Check Job Status"
 
     === "Biowulf Dashboard"
 
         You can check the status of Biowulf jobs through the your [user dashboard](https://hpc.nih.gov/dashboard/).
 
-        Each job that RNA-seek submits to the cluster starts with the `pl:` prefix.
-
-    === "Snakemake Log"
-
-        [Snakemake](https://snakemake.readthedocs.io/en/stable/) generates the following file, `Reports/snakemake.log`, in each pipeline's working directory. This file contains information about each job submitted to the job scheduler. If there are no problems, snakemake will report 100% steps done in those last few lines of the file.
-
-        You can take a peek of the end of the file by running the following command:
-        ```bash
-        tail -n30 Reports/snakemake.log
-        ```
-
-        Or more specifically, you can pull out the timestamps of the last few completed jobs like this:
-        ```bash
-        grep -A 1 done Reports/snakemake.log | tail
-        ```
+        Each job that RENEE submits to the cluster starts with the `pl:` prefix.
 
     === "Query Job Scheduler"
 
@@ -45,54 +34,54 @@ Here are a few suggestions:
         sjobs
         ```
 
-        Each job that RNA-seek submits to the cluster starts with the `pl:` prefix.
+        Each job that RENEE submits to the cluster starts with the `pl:` prefix.
 
-**Q. How do I identify failed jobs?**
 
-**A.** If there are errors, you'll need to identify which jobs failed and check its corresponding SLURM output file.
-The SLURM output file may contain a clue as to why the job failed.
+**Q: What if the pipeline is finished running but I received a "FAILED" status? How do I identify failed jobs?**
 
-!!! tldr "Find Failed Jobs"
+**A.** In case there was some error during the run, the easiest way to diagnose the problem is to go to logfiles folder within the RENEE output folder and look at the `snakemake.log.jobby.short` file. It contains three columns: jobname, state, and std_err. The jobs that completed successfully would have "COMPLETED" state and jobs that failed would have the FAILED state.
 
+!!! tldr "Find Failed Jobs"
     === "SLURM output files"
-
-        Quick and dirty method to search for failed jobs by looking through each job's output file:
+        
+        All the failed jobs would be listed with absolute paths to the error file (with extension `.err`). Go through the error files corresponding to the FAILED jobs (std_err) to explore why the job failed.
 
         ```bash
-        grep -i 'fail' slurmfiles/slurm-*.out
-        ```
+        # Go to the logfiles folder within the renee output folder
+        cd renee_output/logfiles
 
-    === "Snakemake Log"
-
-        [Bash script]( https://github.com/CCBR/Tools/blob/master/Biowulf/get_slurm_file_with_error.sh) identify the SLURM ID of the first failed job and check if the output file exists.
+        # List the files that failed
+        grep "FAILED" snakemake.log.jobby.short | less
+        ```
+
 
-Many failures are caused by filesystem or network issues on Biowulf, and in such cases, simply re-starting the Pipeline should resolve the issue. Snakemake will dynamically determine which steps have been completed, and which steps still need to be run. If you are still running into problems after re-running the pipeline, there may be another issue. If that is the case, please feel free to [contact us](https://github.com/skchronicles/RNA-seek/issues).
+Many failures are caused by filesystem or network issues on Biowulf, and in such cases, simply re-starting the Pipeline should resolve the issue. Snakemake will dynamically determine which steps have been completed, and which steps still need to be run. If you are still running into problems after re-running the pipeline, there may be another issue. If that is the case, please feel free to [contact us](https://github.com/CCBR/RENEE/issues).
 
-**Q. How do I cancel ongoing RNA-seek jobs?**
+**Q. How do I cancel ongoing RENEE jobs?**
 
-**A.** Sometimes, you might need to manually stop a RNA-seek run prematurely, perhaps because the run was configured incorrectly or if a job is stalled. Although the walltime limits will eventually stop the workflow, this can take up to 5 or 10 days depending on the pipeline.
+**A.** Sometimes, you might need to manually stop a RENEE run prematurely, perhaps because the run was configured incorrectly or if a job is stalled. Although the walltime limits will eventually stop the workflow, this can take up to 5 or 10 days depending on the pipeline.
 
-To stop RNA-seek jobs that are currently running, you can follow these options.
+To stop RENEE jobs that are currently running, you can follow these options.
 
 !!! tldr "Cancel running jobs"
 
     === "Master Job"
         You can use the `sjobs` tool [provided by Biowulf](https://hpc.nih.gov/docs/biowulf_tools.html#sjobs) to monitor ongoing jobs.
 
-        Examine the `NAME` column of the `sjobs` output, one of them should match `pl:rna-seek`. This is the "primary" job that orchestrates the submission of child jobs as the pipeline completes. Terminating this job will ensure that the pipeline is cancelled; however, you will likely need to unlock the working directory before re-running rna-seek again. Please see our instructions below in `Error: Directory cannot be locked` for how to unlock a working directory.
+        Examine the `NAME` column of the `sjobs` output, one of them should match `pl:renee`. This is the "primary" job that orchestrates the submission of child jobs as the pipeline completes. Terminating this job will ensure that the pipeline is cancelled; however, you will likely need to unlock the working directory before re-running renee again. Please see our instructions below in `Error: Directory cannot be locked` for how to unlock a working directory.
 
         You can [manually cancel](https://hpc.nih.gov/docs/userguide.html#delete) the primary job using `scancel`.
 
         However, secondary jobs that are already running will continue to completion (or failure).  To stop them immediately, you will need to run `scancel` individually for each secondary job. See the next tab for a bash script that tries to automate this process.
 
     === "Child Jobs"
-        When there are lots of secondary jobs running, or if you have multiple RNA-seek runs ongoing simultaneously, it's not feasible to manually cancel jobs based on the `sjobs` output (see previous tab).
+        When there are lots of secondary jobs running, or if you have multiple RENEE runs ongoing simultaneously, it's not feasible to manually cancel jobs based on the `sjobs` output (see previous tab).
 
-        We provide [a script](https://github.com/CCBR/Tools/blob/master/Biowulf/cancel_snakemake_jobs.sh) that will parse the snakemake log file and cancel all jobs listed within.
+        We provide [a script](https://github.com/CCBR/Tools/blob/c3324fc0ad2f9858438c84bbb2f24927a8f3a220/scripts/cancel_snakemake_jobs.sh) that will parse the snakemake log file and cancel all jobs listed within.
 
         ```bash
         ## Download the script (to the current directory)
-        wget https://raw.githubusercontent.com/CCBR/Tools/master/Biowulf/cancel_snakemake_jobs.sh
+        wget https://raw.githubusercontent.com/CCBR/Tools/c3324fc0ad2f9858438c84bbb2f24927a8f3a220/scripts/cancel_snakemake_jobs.sh
 
         ## Run the script
         bash cancel_snakemake_jobs.sh /path/to/output/logfiles/snakemake.log
@@ -102,24 +91,24 @@ To stop RNA-seek jobs that are currently running, you can follow these options.
 
         This script will NOT cancel the primary job, which you will still have to identify and cancel manually, as described in the previous tab.
 
-Once you've ensured that all running jobs have been stopped, you need to unlock the working directory (see below), and re-run rna-seek to resume the pipeline.
+Once you've ensured that all running jobs have been stopped, you need to unlock the working directory (see below), and re-run RENEE to resume the pipeline.
 
 ## Job Errors
 
 **Q. Why am I getting `sbatch: command not found error`?**
 
-**A.** Are you running the `rna-seek` on `helix.nih.gov` by mistake. [Helix](https://hpc.nih.gov/systems/) does not have a job scheduler. One may be able to fire up the singularity module, initial working directory and perform dry-run on `helix`. But to submit jobs, you need to log into `biowulf` using `ssh -Y username@biowulf.nih.gov`.
+**A.** Are you running the `renee` on `helix.nih.gov` by mistake? [Helix](https://hpc.nih.gov/systems/) does not have a job scheduler. One may be able to fire up the singularity module, initial working directory and perform dry-run on `helix`. But to submit jobs, you need to log into `biowulf` using `ssh -Y username@biowulf.nih.gov`.
 
 **Q. Why am I getting a message saying `Error: Directory cannot be locked. ...` when I do the dry-run?**
 
-**A.** This is caused when a run is stopped prematurely, either accidentally or on purpose, or the pipeline is still running in your working directory. Snakemake will lock a working directory to prevent two concurrent pipelines from writing to the same location. This can be remedied easily by running `rna-seek unlock` sub command. Please check to see if the pipeline is still running prior to running the commands below. If you would like to cancel a submitted or running pipeline, please reference the instructions above.
+**A.** This is caused when a run is stopped prematurely, either accidentally or on purpose, or the pipeline is still running in your working directory. Snakemake will lock a working directory to prevent two concurrent pipelines from writing to the same location. This can be remedied easily by running `renee unlock` sub command. Please check to see if the pipeline is still running prior to running the commands below. If you would like to cancel a submitted or running pipeline, please reference the instructions above.
 
 ```bash
 # Load Dependencies
 module load ccbrpipeliner
 
 # Unlock the working directory
-rna-seek unlock --output /path/to/working/dir
+renee unlock --output /path/to/working/dir
 ```
 
 **Q. Why am I getting a message saying `MissingInputException in line ...` when I do the dry-run?**