The pipeline is run: nextflow run utils/build_example_data/main.nf
The key options are:
-
output_dir
: output direction -
output
: [default "out"] -
pos_allgeno
: file contains chromosome and position that will be extracted from vcf file, each lines is chromosome and position [need, no default] -
list_chro
: list chro extracted fromm 1000 genome [1-22,X : from 1 to 22 and X chromosome] -
ftp1000genome : download 1000 genome for build dataset no : 0 yes :1 [default :1]
-
list_vcf
: file contains list of vcf if ftp1000genome is no, -
information relative of gwas catalog to extract:
-
gwas_cat
: file of gwas catalog, for the moment just uscs format is allowed (not inplemented) -
gwas_cat_ftp
: file to download gwas catalog -
list_pheno
: list pheno extracted from gwas catalog, split by one comma each pheno (default : "Type 2 diabetes") -
extraction and simulation :
-
simu_hsq
: variance explained of disease by genetics [default 0.3] -
simu_k
: Specify the disease prevalence for binary phenotype [default 0.1] -
simu_rep
: repetition number using same snps and heriability effect [default 10] -
used_effect
: two behavious for effect, used gwas catalog effect (option 1) or simulated value using a normal law (option 0) [default : 1] -
clump position of gwas catalog :
-
clump_p1
[default :0.0001] -
clump_p2
[default :0.01] -
clump_r2
[ default : 0.50] -
clump_kb
:[ default : 250 ] -
other :
-
plk_cpus
: cpus number used for plink and gcta[default 10] -
gcta_bin
: binary for gcta [default gcta64]
awk '{print $1"\t"$4}' data/array_plk/array.bim > utils/list_posarray
nextflow run ~/Travail/git/h3agwas//utils/build_example_data/main.nf -profile slurmSingularity --pos_allgeno utils/list_posarray -resume --nb_snp 3 --output_dir simul_gcta_main
need tabix, TODO tested for singularity image: yes
Pipeline used same argument than previous pipeline without download 1000 genomes
- Data and command line can be found h3agwas-examples
nextflow run ~/Travail/git/h3agwas/utils/build_example_data/simul-assoc_gcta.nf -profile slurmSingularity --input_dir data/imputed/ --input_pat imput_data --output_dir simul_gcta
This section describes a pipeline in devlopment, purpose of this pipeline is to estimate false positive and false negative with simulated phenotype, Our script, assoc/simul-assoc.nf takes as input PLINK files that have been through quality control and
- Simulate quantitative phenotypes with phenosim based on genetics data
- perform a GWAS on phenotype simulated using gemma, boltlmm.
- Perform summary statistics.
a version of phenosim adapted is already in nextflow binary, write in python2. plink, gemma and bolt must be installed
The key options are:
work_dir
: the directory in which you will run the workflow. This will typically be the h3agwas directory which you cloned;- input, output and script directories: the default is that these are subdirectories of the
work_dir
and there'll seldom be reason to change these; input_pat
: this typically will be the base name of the PLINK files you want to process (i.e., do not include the file suffix). But you could be put any Unix-style glob here. The workflow will match files in the relevantinput_dir
directory;num_cores
: cores number usedph_mem_req
: memory request for phenosim- Simulation option :
phs_nb_sim
: simulation number (default : 5)phs_quant_trait
: quantitative trait simulation : 1, qualitative not develop yet (default : 1, -q option in phenosim)- Quantitative trait option :
ph_nb_qtl
: number of simulated QTN (default: 2, option -n in phenosim)ph_list_qtl
: proportion of variance explained by each QTNs, separate the values by commas (default : 0.05 -q in phenosim)ph_maf_r
: MAF range for causal markers (upper and lower bound, separated by a comma, no space) (default: 0.05,1.0, -maf_r in phenosim)- option to do a linear transformation of phenotype with co factor of external data and normatisation:
- ph_normalise : perform a normalisation (1) or not 0 (Default)
- each phenotype i be normalise using newpheno = norm(pheno)+var0ia+var1ib+ ... + intercept
ph_cov_norm
: contains coefficients for relation separed by a comma (ex "sex=0.2,age=-0.1)data
: contains cofactor data for each individuals used to normalise withph_cov_range
: normalisation range for initial phenotypeph_intercept
: intercept
- Association option :
boltlmm
: 1 perform boltlmm (default 0), see boltlmm option in assoc/main.nfgemma
: 1 perform gemma (default 0) see gemma option in assoc/main.nfcovariates
: covariates to include in model (if ph_normalise is 1)
- Statistics option :
ph_alpha_lim
: list of alpha used to computed significance (separated by comma)ph_windows_size
: windows size around position used to simulate phenotype to define if was detected, in bp ex 1000bp in CM ex 0.1CM
different output is provided :
- simul folder : contains position used to defined phenotype
- in boltlmm/gemma folder, res_boltlmm/gemma.stat contains summary stat for each alpha:
- we defined
windows true
as the windows around snp used to build phenotype (size is defined previously) nsig_simall_alpha
: number significant snp in all windows truensig_sim_alpha
: number windows true where at least one snps is significantnsig_simaround_alpha
: number significant windows true where one snp is significant and has been excluded snps used to build phenonsig_nosim_alpha
: snp significant snp not in windows truensnp
: snp number total in datasetnsnpsima
: snp number used to build phenotype (see ph_nb_qtl)
- we defined
- in boltlmm/gemma/simul/ : contains p.value compute for each simulation
###Note
- for phenotype simulation all missing values is discarded and replaced by more frequent allele
- phenosim use a lot of memory and time, subsample of snp/samples improve times / memory used
- Data and command line can be found h3agwas-examples
nextflow run ~/Travail/git/h3agwas/utils/build_example_data/simul-assoc_phenosim.nf -profile slurmSingularity --ph_normalise 0 --input_dir data/imputed/ --input_pat imput_data --gemma 1