Skip to content

Blue-Lab/assoc_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Blue Lab pipeline

Tip: run R with R -q --vanilla --args <arguments> < script.R instead of using Rscript, so R will print out a full log of all commands and messages in your script.

Required R packages: argparser, SeqArray, SeqVarTools, SNPRelate, GENESIS

Running with the PBS job scheduler

From beluga1, use runRscript.sh. Use the -v flag to pass the name of the R script (after R=) and any arguments to the script (after args=).

qsub -N jobname -v R="myscript.R",args="--arg1 arg1 --arg2 arg2" runRscript.sh

To run a script by chromosome, use the -t flag to submit each chromosome as a separate task in an array job.

qsub -N ld_pruning -v R="/acct/sdmorris/code/assoc_pipeline/ld_pruning.R",args="myfile.gds --maf 0.01 --missing 0.01" -t 1-22 runRscript.sh

To run a script in parallel, use the -l flag to request multiple processors on a node, in addition to the --num_cores argument to the R script.

qsub -N sample_qc -v R="/acct/sdmorris/code/assoc_pipeline/sample_qc.R",args="myfile.gds --num_cores 12" -l nodes=1:ppn=12 runRscript.sh

Convert to GDS

Use the SeqArray package to convert files.

If you have plink files (BED/BIM/FAM), use seqBED2GDS. If you have VCF, use seqVCF2GDS.

Example for VCF: http://bioconductor.org/packages/release/bioc/vignettes/GENESIS/inst/doc/assoc_test_seq.html#convert-vcf-to-gds

Script for BED: plink_to_gds.R

QC

Outline of recommended QC steps for GWAS data: http://bioconductor.org/packages/release/bioc/vignettes/GWASTools/inst/doc/DataCleaning.pdf

The above document uses GWASTools, but SeqVarTools can be used instead.

Sample QC

Annotated vs genetic sex check

Ideally one would use X and Y intensity values (array data) or X and Y read depth (sequencing data) to infer genetic sex. If these are not available, can use X chromosome heterozygosity and Y chromosome missingness to check sex.

Missing call rate

sample_qc.R

Variant QC

Missing call rate and allele frequency

variant_qc.R

Hardy-Weinberg equilibrium

When a dataset contains samples from multiple populations with different allele frequencies, it is recommended to run HWE in each population separately.

hwe.R

Relatedness

http://bioconductor.org/packages/release/bioc/vignettes/GENESIS/inst/doc/assoc_test_seq.html#population-structure-and-relatedness

  1. ld_pruning.R
  2. pcs_and_grm.R

As an additional QC step, compare pairwise relatedness estimates from PC-Relate with expected values from the pedigree. Example code: https://github.com/UW-GAC/analysis_pipeline/blob/master/R/pedigree_check.R

Association testing

http://bioconductor.org/packages/release/bioc/vignettes/GENESIS/inst/doc/assoc_test_seq.html#association-tests

Dependencies

This pipeline primarily uses R internally. It was built and tested on CentOS 7 and CentOS8 machines with R version 4.0.5, with the following package versions:

Bioconductor

Biobase_2.50.0 BiocGenerics_0.36.1 GENESIS_2.20.1 gdsfmt_1.26.1 SeqArray_1.30.0 SeqVarTools_1.28.1 SNPRelate_1.24.0

(See here for Bioconductor installation instructions.)

CRAN

argparser_0.7.1 coxmeg_1.0.13 dplyr_1.0.8 fastman_0.1.0 GGally_2.1.1 ggplot2_3.3.5 magrittr_2.0.2 RColorBrewer_1.1-2 Rcpp_1.0.8.2 tidyr_1.2.0

Additionally, this pipeline also has suggested dependencies PLINK v1.90b6.14 64-bit and RUTH (November 2020 release).

Installation

After installing the dependencies, on *NIX systems, git clone this directory to your system. The scripts should be executable, so run them like:

/path/to/pipeline/script.R arg1 arg2

Or, add the directory to your $PATH environment variable and you can do instead:

script.R arg1 arg2

About

Scripts for preparing and running association tests

Resources

License

Stars

Watchers

Forks

Packages

No packages published