-
Notifications
You must be signed in to change notification settings - Fork 2
FAQ
sprokopec edited this page Jul 3, 2024
·
11 revisions
- Why did my jobs fail?
- How do I run variant calling on a single chromosome?
- I want to annotate my variants (SNV and INDELs) with OncoKB - how do I do this?
- How can I run MutSigCV on my variant calls?
- I want to filter my structural variants against known blacklists - how can I do this?
- I have single-end data; can pipeline-suite process this?
- How can I run germline (normal-only) samples?
- make sure you are not using --dry-run
- check the available disk space; if there is no space available, jobs cannot write to file and will then fail
- navigate to your PROJECTNAME/logs/run_DNA_pipeline_DATE/ and run the following:
joblist=""
for i in */slurm/s*out; do
jobid=$(basename $i | sed 's/slurm-//' | sed 's/.out//');
joblist="${joblist}${joblist:+,}$jobid";
done
sacct -j $joblist --format='JobID,State,JobName' -P | grep 'pughlab'
- from here, you can see which job specifically failed. Navigate to this tools' log directory (PROJECTNAME/TOOL/logs) and view the slurm_job_metrics.out file to see which job failed and why
- if any are OUT_OF_MEMORY or TIMEOUT, simply update the requested time/memory parameters in the dna_pipeline_config.yaml
- otherwise, if any are FAILED or CANCELLED, read the specific slurm output for that job (./*/slurm-JOBID.out)
- if this file is empty, jobs are either still running OR you have run out of disk space and the file couldn't be filled in
Other common reasons for failures:
- if no jobs ran successfully, there is probably an error in your YAML files (check the whitespace!)
- if some jobs ran successfully, it is possible that you've tried to run too many jobs at once (especially if running WGS with all possible tools) - you can only have 5000 jobs submitted at a time. Generally, you can confirm that this is the case if any tool doesn't contain an output_job_metrics_1/script.sh file in the logs directory.
- it's entirely possible that there is a bug in my code. Reach out and let me know!
- Some tools allow you to specify a set of desired chromosomes within the dna_pipeline_config.yaml file; indicate the desired chromosome, and set seq_type: wgs
- MuTect2, VarScan, GATK:CNV, NovoBreak (in development)
- For any other tool, the easiest approach would be to subset your BAMs to the desired chromosome and create a new data_config.yaml file. Again, be sure to set seq_type: wgs within the config file.
- The OncoKB annotator requires internet access which means it can't be run directory on H4H. However, you can mount H4H and run it locally or transfer your mutation data to Samwise and run it there.
- From H4H or Samwise, you can run the version of OncoKB located in /pughlab/bin.
- Or, download the most recent version from git: oncokb-annotator
- Obtain an OncoKB access key: apiAccess
# move from H4H to Samwise
ssh h4huhndata1
scp /path/to/PROJECTNAME/Report/plots/date_PROJECTNAME_ensemble_mutation_data.tsv xfer:/mnt/work1/users/home/username
exit
# from Samwise
module load python3
python3 /mnt/work1/users/pughlab/bin/oncokb-annotator/MafAnnotator.py \
-t ONCOTREE_CODE \
-b OncoKB key \
-r reference \
-i date_PROJECTNAME_ensemble_mutation_data.tsv \
-o date_PROJECTNAME_ensemble_oncokb_annotated.tsv
# move it back to H4H
ssh xfer
scp date_PROJECTNAME_ensemble_oncokb_annotated.tsv h4huhndata1:/cluster/home/username
exit
You can then pass this data into filter_ensemble_mutations.R to perform filtering and update the SNV landscape plot.
module load R/3.6.1
Rscript /path/to/pipeline-suite/scripts/report/filter_ensemble_mutations.R \
-p PROJECTNAME \
-o /path/to/PROJECTNAME/Report/plots \
-t exome (or wgs) \
-i /path/to/PROJECTNAME/Report/plots/date_PROJECTNAME_ensemble_oncokb_annotated.tsv \
-c /path/to/PROJECTNAME/Report/data/total_bases_covered.tsv \
-m /path/to/PROJECTNAME/Report/data/msi_estimates.tsv
- MutSigCV is a tool that identifies genes that are significantly mutated in cancer genomes, using a model with mutational covariates
- Input is a maf file containing SNVs and/or INDELs (ie, the ensemble_mutation_data.tsv file produced by the --summarize function)
- Below is a sample command using annotations for hg38:
# find significantly mutated genes using MutSigCV
module load MutSigCV/1.4
sh /cluster/tools/software/MutSigCV/1.4/run_MutSigCV.sh /cluster/tools/software/MCR/8.1/v81 \
/path/to/ensemble_mutation_data.tsv \
/cluster/projects/pughlab/references/MutSigCV/exome_full192.coverage.txt \
/cluster/projects/pughlab/references/MutSigCV/gene.covariates.txt \
/path/to/some/output/directory/OUTPUT_STEM \
/cluster/projects/pughlab/references/MutSigCV/chr_files_hg38
- FusionAnnotator uses the CTAT_HumanFusionLib source to annotate known fusions, as well as providing annotations for known blacklists. See here for more information.
- Below is a sample command using annotations for GRCh38 and filtering fusions from all probable blacklists.
# annotate fusions using FusionAnnotator
module load FusionAnnotator
FusionAnnotator \
--genome_lib_dir /cluster/projects/pughlab/references/STAR-Fusion/GRCh38_gencode_v31_CTAT_lib_Oct012019/ctat_genome_lib_build_dir/ \
--annotate /path/to/mavis/output/DATE_PROJECTNAME_sv_data_for_cbioportal.tsv \
--fusion_name_col 34 \
--full > /path/to/mavis/output/DATE_PROJECTNAME_sv_data_for_cbioportal_annotated.tsv
# remove entries annotated with known blacklists
cat DATE_PROJECTNAME_sv_data_for_cbioportal_annotated.tsv | grep -v -e 'Babiceanu' -e 'Greger' -e 'BodyMap' -e 'GTEx_recurrent_StarF2019' -e 'DGD_PARALOGS' -e 'HGNC_GENEFAM' -e 'ConjoinG' > DATE_PROJECTNAME_sv_data_for_cbioportal_annotated_filtered.tsv
- Not all tools included in this pipeline can accept single-end data (see below). To include fastqs with SE reads in the alignment step, see the instructions here: creating your fastq.yaml
- For variant calling, if you have a mix of SE and PE data, note that some tools will fail when encountering SE reads. You can either run as normal and deal with errors as they come, or exclude the SE input when running these tools. If you only have SE reads, set these tools to 'run: N' in the data config to avoid errors.
- Tools that will fail with SE data:
- Manta/Strelka (require paired-end reads)
- use pughlab_dnaseq_germline_pipeline.pl instead of pughlab_dnaseq_pipeline.pl
- preprocessing will still run BWA and GATK's indel realignment and BQSR
- qc will still run the coverage.pl and get_sequencing_metrics.pl steps
- variant_calling will still run where possible:
- HaplotypeCaller/GenotypeGVCFs/CPSR for germline variant detection
- MuTect, MuTect2 will run using artefactdetection (ie, used for PoN creation)
- Strelka, VarScan and VarDict will run in single-sample mode
- Manta and SViCT will run in single-sample mode
- Delly and GATK:CNV will run in germline-detection mode
- summarize and create report steps may not work (development in progress!)
module load perl
perl ~/git/pipeline-suite/pughlab_dnaseq_germline_pipeline.pl \
-t /path/to/dna_pipeline_config.yaml \
-d /path/to/dna_fastq_config.yaml \
--preprocessing \
--qc \
--variant_calling \
--summarize \
--create_report \
-c slurm \
--remove \
--dry-run { optional }