A new simple and efficient software to perform PCA and Clustering analysis For population VCF File
The VCF2PCACluster article has been published in BMC Bioinformatics Journal, please cited this article if possible
PMID: 38693489 DOI:10.1186/s12859-024-05770-1
VCF2PCACluster is an easy-to-use tool for the PCA and clustering analysis and visualization based on VCF formatted input or Genotype.
Highlights:
- The result is the same with that generated by tassel,gapit and gcta , and only with the difference in precision.
- Functions include: 1) five kinship estimation methods, 2) PCA analysis, 3) Clustering, 4) Visualization
- easy-to-use that users only need to provide a VCF input
- memory-efficient that independent on the number of SNPs
- Three clustering methods, K-Means, EM Gaussian and DBSCAN
- Visualization in 2D or 3D plots
The new version of VCF2PCACluster will be updated and maintained in hewm2008/VCF2PCACluster. Please click below link to download the latest version. hewm2008/VCF2PCACluster
VCF2PCACluster is for Linux/Unix/macOS only.
Before installing, please make sure the following pre-requirements are ready to use.
- OpenMP c/c++ command is recommended to be pre-installed
- g++ : g++ with --std=c++11 > 4.8+ is recommended
- zlib : zlib > 1.2.3 is recommended
- R : R with ggplot2 and scatterplot3d are recommended
Users can install it with the following options:
Option 1,we provide a static version for Linux/Unix X64
git clone https://github.com/hewm2008/VCF2PCACluster.git
cd VCF2PCACluster; chmod 755 -R bin/*
./bin/VCF2PCACluster -h ### print help information
Option 2: compile from source code for Linux/Unix/macOS
git clone https://github.com/hewm2008/VCF2PCACluster.git
cd VCF2PCACluster ; chmod 755 configure ; ./configure;
make; # sh make.sh
mv VCF2PCACluster bin/; # [rm *.o]
Note: For macOS , users can run the following command first.
Please ensure g++-11
has been installed using the homebrew, we have successfully tested on the macOS Monterey, Apple M1 chip.
ln -s /opt/homebrew/bin/g++-11 /opt/homebrew/bin/g++ ;
export PATH=/opt/homebrew/bin/:$PATH
Usage: VCF2PCACluster -InVCF in.vcf.gz -OutPut outPrefix [options]
-InVCF <str> Input SNP VCF Format
-InGenotype <str> InPut Genotype File
-InKinship <str> Input SNP K Kinship File Format
-OutPut <str> OutPut File Prefix(Kinship PCA etc)
-KinshipMethod <int> Method of Kinship [1-5],defaut [1]
1:Normalized_IBS[(Yang/BaldingNicolsKinship]
2:Centered_IBS(VanRaden)
3:IBSKinshipImpute 4:IBSKinship 5:p_dis
-ClusterMethod <str> Method For Cluster[EM/Kmean/DBSCAN/None] [EM]
-Threads <int> threads to use [32]
-help Show more Parameters and help [hewm2008]
General usage:
### running without pop.info
# VCF2PCACluster -InVCF Khuman.vcf.gz -OutPut OUT
### running with pop.info
VCF2PCACluster -InVCF Khuman.vcf.gz -OutPut OUT -InSampleGroup pop.info
# for more Help document please see the manual. Para [-i] is show for [-InVCF], Para [-o] is show for [-OutPut]
Usage: VCF2PCACluster -InVCF in.vcf.gz -OutPut outPrefix [options]
-InVCF <str> Input SNP VCF Format
-InKinship <str> Input SNP K Kinship File Format
-OutPut <str> OutPut File Prefix(Kinship PCA etc)
-KinshipMethod <int> Method of Kinship [1-5],defaut [1]
1:Normalized_IBS(Yang/BaldingNicolsKinship)
2:Centered_IBS(VanRaden)
3:IBSKinshipImpute 4:IBSKinship 5:p_dis
-ClusterMethod <str> Method For Cluster[EM/Kmean/DBSCAN/None] [EM]
-Threads <int> threads to use [32]
-help v1.41 Show more Parameters and help [hewm2008]
InFile:
-InGenotype <str> InPut Genotype File for no VCF file
-InSubSample <str> Only keep samples from subsample List for PCA[ALLsample]
-InSampleGroup <str> InFile of sample Group info,format(sample groupA)
SNP Filtering:
-MAF <float> Min minor allele frequency filter [0.001]
-Miss <float> Max ratio of miss allele filter [0.25]
-Het <float> Max ratio of het allele filter [1.00]
-HWE <float> Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0]
-Fchr <str> Filter the chrX chr[chrX,chrY,X,Y]
-KeepRemainVCF keep the VCF after filter
Clustering:
-RandomCenter Random diff-center to Re-Run Cluster for Kmean
-BestKManually <int> manually set the Best K (Num of Cluster) (auto)
-BestKRatio <float> Get the best K Cluster by deta-SSE Ratio[0.15]
-MinPointNum <int> Minimum point number of D-cluster[4]
-Epsilon <float> Epsilon for DBSCAN_Distance/EM_convergence (auto)
-Iterations <int> iterations number for EM clustering[1000]
OutPut:
-PCnum <int> Num of PC eig [10]
VCF2PCACluster also provides two custom scripts (Plot2Deig and Plot3Deig) for 2D or 3D plots, the brief parameters of the plot script are as follows:
perl Plot2Deig/Plot3Deig -h
Version:1.41 hewm2008@gmail.com
Usage: Plot2Deig/Plot3Deig -InFile pca.eigenvec -OutPut Fig
Options
-InFile <s> : InPut PCA.eigenvec File
-OutPut <s> : OutPut svg file result
-help : Show more help with more parameter
-ColShap : colour <=> shape for cluster or group
-ShowEval : Show eval%(PC percentages) on the fig
-Columns <s> : the columns to plot a:b [4:5]
-ColorBrewer <s> : the color brewer for points [Dark2]
-Title <s> : title (legend) [PCA]
-BinDir <s> : The Bin Dir of gnuplot/R/convert [$PATH]
hewm2008@gmail.com / hewm2008@qq.com
join the QQ Group : 125293663
outFile | Description |
---|---|
out.kinship | Kinship matrix file |
out.eigenvec | the best clustering and PCA result |
out.eigenval | PCA eigen values |
out.PC1_PC2.pdf | PCA and clustering 2D plot |
out.PC1PC2PC3.pdf | PCA and clustering 3D plot |
See more detailed usage in Chinese Documentation. See more detailed usage in English Documentation.
../../bin/VCF2PCACluster -InVCF in.vcf.gz -OutPut outPrefix
Two examples were provided in the directory of Example/example*
.
- Example 1) a small test dataset
We randomly selected 1,194 SNPs on chromosome (chr) 22 from 1000 Genome Project with 203 samples including CEU(49) , CHB(46) , JPT(56)and YRI (52)for analysis.
PCA and EM Gaussian clustering plot using PC1 and PC2
PCA and EM Gaussian clustering plot using PC1 and PC3
PCA and EM Gaussian clustering plot using PC1,PC2, and PC3
- Example 2) a large test dataset
To test the accuracy and the efficiency of VCF2PCACluster, we downloaded data from 1000 Genome Project to test following softwares, and used the chr22 (minimal chromosome SNP database) (2504 sample with 1,055,401 SNP numbers) to benchmark these softwares.
The result is the same with that generated by tassel and gcta64, Please see more details in the manual.
Waiting time ~12.5min with 8 threads;
Memory usage is about 0.1G, we test for all chr1-22(81271745 site) VCF, the memory usage of VCF2PCACluster is still 0.1G, but the Plink2 exceeds 200g, and returns an error.
echo Start Time :
date
## download the real data ###
#wget -c https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
#wget -c https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
# cut -f 1,3 integrated_call_samples_v3.20130502.ALL.panel > sample.group ; gzip sample.group
time ../../bin/VCF2PCACluster -InVCFALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz -InSampleGroup sample.group.gz -OutPut OUT1
## to re-set the best K (4--->5)
time ../../bin/VCF2PCACluster -InKinship OUT1.Normalized_IBS.Kinship -InSampleGroup sample.group.gz -OutPut OUT2 -BestKManually 5
echo End Time :
date
PCA and clustering Result: the correlation coefficient for prior group labels and clustering is 0.995 calculated using cor function in R
- fast and low memory usage
- Simple and easy to use (-i -o)
- five kinship estimation methods
- three clustering methods
- Free of installation
- only one step from VCF to the final plot
- 2D or 3D plots of PCA and clustering results
If any question, please
- email to hewm2008@gmail.com or hewm2008@qq.com
- join the QQ Group : 125293663
######################swimming in the sky and flying in the sea #############################