Tip: run R with R -q --vanilla --args <arguments> < script.R
instead of using Rscript
, so R will print out a full log of all
commands and messages in your script.
Required R packages: argparser, SeqArray, SeqVarTools, SNPRelate, GENESIS
From beluga1, use runRscript.sh. Use the -v
flag to
pass the name of the R script (after R=
) and any arguments to the
script (after args=
).
qsub -N jobname -v R="myscript.R",args="--arg1 arg1 --arg2 arg2" runRscript.sh
To run a script by chromosome, use the -t
flag to submit each
chromosome as a separate task in an array job.
qsub -N ld_pruning -v R="/acct/sdmorris/code/assoc_pipeline/ld_pruning.R",args="myfile.gds --maf 0.01 --missing 0.01" -t 1-22 runRscript.sh
To run a script in parallel, use the -l
flag to request multiple
processors on a node, in addition to the --num_cores
argument to the
R script.
qsub -N sample_qc -v R="/acct/sdmorris/code/assoc_pipeline/sample_qc.R",args="myfile.gds --num_cores 12" -l nodes=1:ppn=12 runRscript.sh
Use the SeqArray package to convert files.
If you have plink files (BED/BIM/FAM), use seqBED2GDS
. If you have VCF, use seqVCF2GDS
.
Example for VCF: http://bioconductor.org/packages/release/bioc/vignettes/GENESIS/inst/doc/assoc_test_seq.html#convert-vcf-to-gds
Script for BED: plink_to_gds.R
Outline of recommended QC steps for GWAS data: http://bioconductor.org/packages/release/bioc/vignettes/GWASTools/inst/doc/DataCleaning.pdf
The above document uses GWASTools, but SeqVarTools can be used instead.
Ideally one would use X and Y intensity values (array data) or X and Y read depth (sequencing data) to infer genetic sex. If these are not available, can use X chromosome heterozygosity and Y chromosome missingness to check sex.
When a dataset contains samples from multiple populations with different allele frequencies, it is recommended to run HWE in each population separately.
As an additional QC step, compare pairwise relatedness estimates from PC-Relate with expected values from the pedigree. Example code: https://github.com/UW-GAC/analysis_pipeline/blob/master/R/pedigree_check.R
This pipeline primarily uses R internally. It was built and tested on CentOS 7 and CentOS8 machines with R version 4.0.5, with the following package versions:
Biobase_2.50.0 BiocGenerics_0.36.1 GENESIS_2.20.1 gdsfmt_1.26.1 SeqArray_1.30.0 SeqVarTools_1.28.1 SNPRelate_1.24.0
(See here for Bioconductor installation instructions.)
argparser_0.7.1 coxmeg_1.0.13 dplyr_1.0.8 fastman_0.1.0 GGally_2.1.1 ggplot2_3.3.5 magrittr_2.0.2 RColorBrewer_1.1-2 Rcpp_1.0.8.2 tidyr_1.2.0
Additionally, this pipeline also has suggested dependencies PLINK v1.90b6.14 64-bit and RUTH (November 2020 release).
After installing the dependencies, on *NIX systems, git clone
this directory
to your system. The scripts should be executable, so run them like:
/path/to/pipeline/script.R arg1 arg2
Or, add the directory to your $PATH
environment variable and you can do
instead:
script.R arg1 arg2