|
| 1 | + |
1 | 2 | # MingPCACluster
|
2 |
| -A new simple and efficient software to PCA and Cluster For popolation VCF File or STOmics gem File |
| 3 | +A new simple and efficient software to PCA and Cluster For popolation VCF File or STOmics gem File |
| 4 | + |
| 5 | +### 1 Introduction |
| 6 | +<b>MingPCACluster</b> 是于基于VCF开发的PCA分析和聚类软件,同时兼并了Genotype 等格式软件,同时开发针对时空单细胞表达量的格式(xx.gem.gz)文件(beta功能)。 |
| 7 | +即只要对应的一个输入文件进来,这PCA和作图分组等一位到位。 |
| 8 | +</br> |
| 9 | +</br> keyword : VCF2PCA ; VCF2Kinship ; cluster; k-means ; cellbin ; STOmics |
| 10 | + |
| 11 | +</br>亮点: |
| 12 | +</br>1 The result is the same with [tassel](https://www.maizegenetics.net/tassel),[gapit](https://zzlab.net/GAPIT/) and [gcta](https://yanglab.westlake.edu.cn/software/gcta/#Overview) , just the difference in precision. |
| 13 | +</br>2 功能有 1 多种kinship矩阵 2 PCA结果 3 聚类结果 和4 以cluster染色并作图。 |
| 14 | +</br>3 一个VCF输入,一步到位,方便用户使用. |
| 15 | +</br>4 边读边算,内存剥离受位点多少的影响(时空组是剥离受基因数量多少的影响),内存只受样品量影响,故上100k的样品当也行,在这个基础上开发时空细胞PCA和聚类,虽然时空组学上主要是样品多。(80K 60G内存) |
| 16 | +</br>5 Kmean聚类分析,并找出最佳K值,和Structure和K值一样. 作图以此染色。 |
| 17 | +</br>6 提作作图小脚本,可以用这个脚本优化作图等。 |
| 18 | + |
| 19 | +</br> |
| 20 | + |
| 21 | +</br>程序是给一些有基础的生信朋友用的,若是小白看不懂就算了。 |
| 22 | +</br> |
| 23 | +</br><b>MingPCACluster</b> MingPCAC is a PCA analysis software format developed based on VCF. It also incorporates Genotype, etc., and develops a file (beta function) for the expression of spatiotemporal cells. That is, as long as the input is satisfied, the PCA and the cluster group are of the same output. |
| 24 | + |
| 25 | + |
| 26 | +### 2 Download and Install |
| 27 | +------------ |
| 28 | +The <b>new version</b> will be updated and maintained in <b>[hewm2008/MingPCACluster](https://github.com/hewm2008/MingPCACluster)</b>, please click below website to download the latest version |
| 29 | +</br><p align="center"><b>[hewm2008/MingPCACluster](https://github.com/hewm2008/MingPCACluster)</b></p> |
| 30 | + |
| 31 | +<b> 2.1. linux/MaxOS [Download](https://github.com/hewm2008/MingPCACluster/archive/v1.00.tar.gz)</b> |
| 32 | + |
| 33 | + </br> <b>2.2 Pre-install</b> |
| 34 | + </br> MingPCACluster is for Linux/Unix/macOS only. |
| 35 | + </br>Before installing,please make sure the following pre-requirements are ready to use. |
| 36 | + </br> 1) [convert](https://linux.die.net/man/1/convert) command is recommended to be pre-installed, although it is not required |
| 37 | + </br> 2) g++ : g++ with [--std=c++11](https://gcc.gnu.org/) > 4.8+ is recommended |
| 38 | + </br> 3) zlib : [zlib](https://zlib.net/) > 1.2.3 is recommended |
| 39 | + </br> 4) R : [R](https://www.r-project.org/) with [ggplot](http://ggplot.yhathq.com/) is recommended |
| 40 | + |
| 41 | +</br> <b>2.3 Install</b> |
| 42 | +</br> Users can install it with the following options: |
| 43 | +</br> Option 1: |
| 44 | +<pre> |
| 45 | + git clone https://github.com/hewm2008/MingPCACluster.git |
| 46 | + cd MingPCACluster; chmod 755 -R bin/* |
| 47 | + ./bin/MingPCACluster -h |
| 48 | +</pre> |
| 49 | + |
| 50 | + |
| 51 | +### 3 Parameter description |
| 52 | +------------ |
| 53 | +</br><b>3.1 MingPCACluster</b> |
| 54 | +</br><b>3.1.1 Main parameter</b> |
| 55 | + |
| 56 | +```php |
| 57 | + |
| 58 | + Usage: Ming2PCACluster -InVCF <in.vcf.gz> -OutPut <outPrefix> |
| 59 | + |
| 60 | + -InVCF <str> Input SNP VCF Format |
| 61 | + -InGenotype <str> InPut Genotype File |
| 62 | + -InSTOgem <str> InPut STOmics gem File of MIDCounts(beta) |
| 63 | + -InKinship <str> Input SNP K Kinship File Format |
| 64 | + -OutPut <str> OutPut File Prefix(Kinship PCA etc) |
| 65 | + |
| 66 | + |
| 67 | + -SubPop <str> SubGroup Sample File List[ALLsample] |
| 68 | + -Method <int> Method of Kinship [1-4],defaut [1] |
| 69 | + 1:BaldingNicolsKinship(VanRaden/Normalized_IBS) |
| 70 | + 2:IBSKinshipImpute 3:IBSKinship 4:p_dis |
| 71 | + |
| 72 | + -help Show more Parameters and help [hewm2008] |
| 73 | + |
| 74 | + |
| 75 | +``` |
| 76 | +</br> brief description for function: |
| 77 | +<pre> |
| 78 | + # 用法一看即明,最基础的为 一个输入和输出即可 |
| 79 | + # 输入文件基因组格式见 pdf.主要为VCF和gem文件 |
| 80 | + # 更多说明后面将在知乎更新 |
| 81 | + |
| 82 | + Ming2PCACluster -InSTOgem Test.gem.gz -OutPut Test -CellBin 100 |
| 83 | + |
| 84 | + ### run without pop.info |
| 85 | + # Ming2PCACluster -InVCF Khuman.vcf.gz -OutPut OUT |
| 86 | + ### run with pop.info |
| 87 | + Ming2PCACluster -InVCF Khuman.vcf.gz -OutPut OUT -InSampleGroup pop.info |
| 88 | + |
| 89 | +</pre> |
| 90 | + |
| 91 | +</br><b>3.1.2 Detail parameters</b> |
| 92 | +```php |
| 93 | + |
| 94 | + Usage: Ming2PCACluster -InVCF <in.vcf.gz> -OutPut <outPrefix> |
| 95 | + |
| 96 | + -InVCF <str> Input SNP VCF Format |
| 97 | + -InGenotype <str> InPut Genotype File |
| 98 | + -InSTOgem <str> InPut STOmics gem File of MIDCounts(beta) |
| 99 | + -InKinship <str> Input SNP K Kinship File Format |
| 100 | + -OutPut <str> OutPut File Prefix(Kinship PCA etc) |
| 101 | + |
| 102 | + |
| 103 | + -SubPop <str> SubGroup Sample File List[ALLsample] |
| 104 | + -Method <int> Method of Kinship [1-4],defaut [1] |
| 105 | + 1:BaldingNicolsKinship(VanRaden/Normalized_IBS) |
| 106 | + 2:IBSKinshipImpute 3:IBSKinship 4:p_dis |
| 107 | + |
| 108 | + -help Show more Parameters and help [hewm2008] |
| 109 | + |
| 110 | + |
| 111 | + -MAF <float> Min minor allele frequency filter [0.001] |
| 112 | + -Fchr <str> Filter the chrX chr[chrX,chrY,X,Y] |
| 113 | + -Miss <float> Max ratio of miss allele filter [0.25] |
| 114 | + -Het <float> Max ratio of het allele filter [1.00] |
| 115 | + -HWE <float> Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0] |
| 116 | + -CellBin <int> STOmics cell bin[50] |
| 117 | + -KeepRemainVCF keep the VCF after filter |
| 118 | + |
| 119 | + -InSampleGroup <string> In File of sampleGroup info,format(sample groupA) |
| 120 | + |
| 121 | + -PCANum <int> Num of PCA eig [10] |
| 122 | + -MaxCluNum <int> Max Cluster Num to find Best K [12] |
| 123 | + -BestKRatio <float> Get the best K Cluster by deta-SSE Ratio[0.1] |
| 124 | + -STOName <string> STOmics Sample Name STOName |
| 125 | + |
| 126 | + |
| 127 | +``` |
| 128 | + |
| 129 | +</br><b>3.2.2 Other parameters</b> |
| 130 | +</br>程序也提供了作图软件perl 作图脚本(这个脚本后面将会优化更动较大,主要是最近时间较忙),作图脚本的简要参数说明如下: |
| 131 | + |
| 132 | +```php |
| 133 | +ploteig -h |
| 134 | + |
| 135 | + Version:1.0 hewm2008@gmail.com |
| 136 | + |
| 137 | + Usage: ploteig -InPCA pca.eigenvec -OutPrefix Fig |
| 138 | + |
| 139 | + |
| 140 | + Options |
| 141 | + |
| 142 | + -InPCA <s> : InPut File of PCA |
| 143 | + -OutPrefix <s> : OutPut file prefix |
| 144 | + |
| 145 | + -BinDir <s> : The Bin Dir of gnuplot/R/ps2pdf/convert [$PATH] |
| 146 | + |
| 147 | + -help : Show more help [hewm2008] |
| 148 | + |
| 149 | + -columns <s> : the columns to plot a:b [3:4] |
| 150 | + -pops <s> : Populations to plot, eg -p GA:GB:GC [ALL] |
| 151 | + -border <i> : how to plot the border (1,2,4,8,3,31 ) [3] |
| 152 | + -title <s> : title (legend) [PCA] |
| 153 | + -keystyle <s> : put key at top right default(in) [outside]box [outside] |
| 154 | + -pointsize <i> : point size for plot [3] |
| 155 | + |
| 156 | + |
| 157 | +``` |
| 158 | + |
| 159 | +</br><b>3.3 Output files</b> |
| 160 | + |
| 161 | + |
| 162 | +|Module | outFlie | Description | |
| 163 | +|:-----:|:-------------------|:------------------------------------------------------------| |
| 164 | +| List | | | |
| 165 | +| |out.kinship |输出的亲缘矩阵,各样品的两两关系 | |
| 166 | +| |out.eigenvec |输出最优聚类和pca结果 | |
| 167 | +| |out.eigenval |输出最优聚类和pca结果 | |
| 168 | +| |out.PCA1_PCA2.pdf |输出按cluster染色后的pca 1 2图 | |
| 169 | +| |out.K.pdf |输出cluster K图 | |
| 170 | +| |out.cluster |输出的各种K的cluster聚类结果 | |
| 171 | +| |Out.cellbin.gz |输出bin50 cell的结果,若是 -InSTOgem | |
| 172 | +| |Out.cluster pdf/png |输出坐标cluester图,若是-InSTOgem | |
| 173 | + |
| 174 | + |
| 175 | +示例图见上面应用场景给的图。示例图和格式当一看即明,相关图可以见example 1 和2 |
| 176 | + |
| 177 | + |
| 178 | +### 4 Example |
| 179 | +------------ |
| 180 | + |
| 181 | +</br>See more detailed usage in the <b>[Chinese Documentation](https://github.com/hewm2008/MingPCACluster/blob/main/Ming2PCACluster使用手册_manual_chinese.pdf)</b> |
| 182 | +</br>See more detailed usage in the <b>[English Documentation](https://github.com/hewm2008/MingPCACluster/blob/main/Ming2PCACluster使用手册_manual_chinese.pdf)</b> |
| 183 | +</br>See the example directory and Manual.pdf for more detail. |
| 184 | +</br>具体见这儿 Manual.pdf for more detail 里面的示意数据和脚本,后期将在某些网址释放一些教程 |
| 185 | +</br></br> |
| 186 | +../../bin/MingPCACluster -InVCF in.vcf.gz -OutPut outPrefix |
| 187 | +</br> 目录 Example/example*/ 里面有输入和输出和脚本用法。 |
| 188 | + |
| 189 | + |
| 190 | +* Example 1)千人VCF重测序SNP基因型 |
| 191 | +</br> 共从K 人数据chr22 dbsnp里面随机挑出了3492个位点,挑 CEU(49) , CHB(46) , JPT(56)和 YRI (52)共203 个样品来分析。 |
| 192 | +</br>聚类走势,best K |
| 193 | +</br> |
| 194 | +</br>PCA结果 |
| 195 | +</br> |
| 196 | + |
| 197 | + |
| 198 | +* Example 2) cellbin时空细胞表达量pca和聚类 |
| 199 | + |
| 200 | +</br>时空分析我初了解主要是:seurat ,我很浅淡的了解,这个包用到的n*m (n是样品,m是位点)的稀疏矩阵,好像周边的做时空的人总说内存很大,我这没有对时空数据敏感,对表达量进行了取log10. 也用了稀疏矩阵 和 n*n, 由于时空n是样品量很大,怕难下降。 |
| 201 | +</br>初以 我这用了文件大于(File.gem.gz : 177M ), 范围: XXmin: 4975 XXmax: 23374 YYmin: 2525 YYmax: 20724 )。取bin 50, n达到的88507,即主要88507*88507的矩阵double上,占用60.742G (稀疏矩阵5G 矩阵:55G) 。 |
| 202 | + |
| 203 | +</br> PCA K Fig |
| 204 | +</br> |
| 205 | +</br> PCA plot Fig |
| 206 | +</br> |
| 207 | +</br> STOmics Cluster plot Fig |
| 208 | +</br> |
| 209 | + |
| 210 | + |
| 211 | +### 5 Advantages |
| 212 | + |
| 213 | +</br>速度快,少内存 fast speed, low memory |
| 214 | +</br>简明易用 Simple and easy to use |
| 215 | +</br>免安装 Free installation |
| 216 | + |
| 217 | + |
| 218 | +### 6 Discussing |
| 219 | +------------ |
| 220 | +- [:email:](https://github.com/hewm2008/MingPCACluster) hewm2008@gmail.com / hewm2008@qq.com |
| 221 | +- join the<b><i> QQ Group : 125293663</b></i> |
| 222 | + |
| 223 | +######################swimming in the sky and flying in the sea ############################# |
| 224 | + |
0 commit comments