This repository is for analyzing the UKBiobank data.
In this section, we select the samples and sort the phenotype data.
The order of sqc data and fam data are the same. The sample size of the two data is 488377.
The output of STEP 1: QC sample: 488377
Select the samples from combined sqc data.
The output of STEP 2:
Genotyping success: 487409
White British ancestry subset: 408972
Excess relatives: 188
Sex chromosome aneuploidy: 652
Used in PCA calculation: 407219
Redacted: 14
Samples Remaining: 377198
Sort phenotype data as the order of sqc data because order of the genotype data is same to the sqc data.
The introduction to datasets of UKBiobank: http://www.ukbiobank.ac.uk/wp-content/uploads/2017/07/ukb_genetic_file_description.txt
In this section, we get the summary data.
Load the data from the first section, including phenotype, sqc and sqcNA. Get the index of each samples. The selection standards are sqc, phenotype and cross validation.
Using the function summ
, we can easily get the summary data of all selected SNPs from the bgen format.
summ -maf mad_num -info info_num -hwe hwe_num -call call_num \
-thread thread_num -prop prop_num -seed sedd_num \
-pheno pheno_file -sqc sqc_file -bgen bgen_file \
-outpath out_path -outfile out_file -chr chr_num -cv 0
- -maf, -info, -hwe and -call: the minimum of MAF (1e-3), information (0.8), hwe (1e-7) and calling rate (0.8).
- -thread: the thread to parallel.
- -prop and -seed: the proportion of training data and seed.
- -pheno: phenotype data (csv format).
- -sqc: sqc index data (csv format).
- -bgen: bgen data.
- -outpath and -output: outpath and outfile (txt format)
- -chr: chromosome number.
- -cv: cv number.