Skip to content

Commit ecc426e

Browse files
authored
Update README.md
1 parent 705b183 commit ecc426e

File tree

1 file changed

+223
-1
lines changed

1 file changed

+223
-1
lines changed

README.md

Lines changed: 223 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,224 @@
1+
12
# MingPCACluster
2-
A new simple and efficient software to PCA and Cluster For popolation VCF File or STOmics gem File
3+
A new simple and efficient software to PCA and Cluster For popolation VCF File or STOmics gem File
4+
5+
### 1 Introduction
6+
<b>MingPCACluster</b> 是于基于VCF开发的PCA分析和聚类软件,同时兼并了Genotype 等格式软件,同时开发针对时空单细胞表达量的格式(xx.gem.gz)文件(beta功能)。
7+
即只要对应的一个输入文件进来,这PCA和作图分组等一位到位。
8+
</br>
9+
</br> keyword : VCF2PCA ;&nbsp;&nbsp;&nbsp;&nbsp; VCF2Kinship ;&nbsp;&nbsp;&nbsp;&nbsp; cluster;&nbsp;&nbsp;&nbsp;&nbsp; k-means ;&nbsp;&nbsp;&nbsp;&nbsp; cellbin ;&nbsp;&nbsp;&nbsp;&nbsp;STOmics
10+
11+
</br>亮点:
12+
</br>1 The result is the same with [tassel](https://www.maizegenetics.net/tassel),[gapit](https://zzlab.net/GAPIT/) and [gcta](https://yanglab.westlake.edu.cn/software/gcta/#Overview) , just the difference in precision.
13+
</br>2 功能有 1 多种kinship矩阵 2 PCA结果 3 聚类结果 和4 以cluster染色并作图。
14+
</br>3 一个VCF输入,一步到位,方便用户使用.
15+
</br>4 边读边算,内存剥离受位点多少的影响(时空组是剥离受基因数量多少的影响),内存只受样品量影响,故上100k的样品当也行,在这个基础上开发时空细胞PCA和聚类,虽然时空组学上主要是样品多。(80K 60G内存)
16+
</br>5 Kmean聚类分析,并找出最佳K值,和Structure和K值一样. 作图以此染色。
17+
</br>6 提作作图小脚本,可以用这个脚本优化作图等。
18+
19+
</br>
20+
21+
</br>程序是给一些有基础的生信朋友用的,若是小白看不懂就算了。
22+
</br>
23+
</br><b>MingPCACluster</b> MingPCAC is a PCA analysis software format developed based on VCF. It also incorporates Genotype, etc., and develops a file (beta function) for the expression of spatiotemporal cells. That is, as long as the input is satisfied, the PCA and the cluster group are of the same output.
24+
25+
26+
### 2 Download and Install
27+
------------
28+
The <b>new version</b> will be updated and maintained in <b>[hewm2008/MingPCACluster](https://github.com/hewm2008/MingPCACluster)</b>, please click below website to download the latest version
29+
</br><p align="center"><b>[hewm2008/MingPCACluster](https://github.com/hewm2008/MingPCACluster)</b></p>
30+
31+
<b> 2.1. linux/MaxOS&nbsp;&nbsp;&nbsp; [Download](https://github.com/hewm2008/MingPCACluster/archive/v1.00.tar.gz)</b>
32+
33+
</br> <b>2.2 Pre-install</b>
34+
</br> MingPCACluster is for Linux/Unix/macOS only.
35+
</br>Before installing,please make sure the following pre-requirements are ready to use.
36+
</br> 1) [convert](https://linux.die.net/man/1/convert) command is recommended to be pre-installed, although it is not required
37+
</br> 2) g++ : g++ with [--std=c++11](https://gcc.gnu.org/) > 4.8+ is recommended
38+
 </br> 3) zlib : [zlib](https://zlib.net/) > 1.2.3 is recommended
39+
 </br> 4) R : [R](https://www.r-project.org/) with [ggplot](http://ggplot.yhathq.com/) is recommended
40+
41+
</br> <b>2.3 Install</b>
42+
</br> Users can install it with the following options:
43+
</br> Option 1:
44+
<pre>
45+
git clone https://github.com/hewm2008/MingPCACluster.git
46+
cd MingPCACluster; chmod 755 -R bin/*
47+
./bin/MingPCACluster -h
48+
</pre>
49+
50+
51+
### 3 Parameter description
52+
------------
53+
</br><b>3.1 MingPCACluster</b>
54+
</br><b>3.1.1 Main parameter</b>
55+
56+
```php
57+
58+
Usage: Ming2PCACluster -InVCF <in.vcf.gz> -OutPut <outPrefix>
59+
60+
-InVCF <str> Input SNP VCF Format
61+
-InGenotype <str> InPut Genotype File
62+
-InSTOgem <str> InPut STOmics gem File of MIDCounts(beta)
63+
-InKinship <str> Input SNP K Kinship File Format
64+
-OutPut <str> OutPut File Prefix(Kinship PCA etc)
65+
66+
67+
-SubPop <str> SubGroup Sample File List[ALLsample]
68+
-Method <int> Method of Kinship [1-4],defaut [1]
69+
1:BaldingNicolsKinship(VanRaden/Normalized_IBS)
70+
2:IBSKinshipImpute 3:IBSKinship 4:p_dis
71+
72+
-help Show more Parameters and help [hewm2008]
73+
74+
75+
```
76+
</br> brief description for function:
77+
<pre>
78+
# 用法一看即明,最基础的为 一个输入和输出即可
79+
# 输入文件基因组格式见 pdf.主要为VCF和gem文件
80+
# 更多说明后面将在知乎更新
81+
82+
Ming2PCACluster -InSTOgem Test.gem.gz -OutPut Test -CellBin 100
83+
84+
### run without pop.info
85+
# Ming2PCACluster -InVCF Khuman.vcf.gz -OutPut OUT
86+
### run with pop.info
87+
Ming2PCACluster -InVCF Khuman.vcf.gz -OutPut OUT -InSampleGroup pop.info
88+
89+
</pre>
90+
91+
</br><b>3.1.2 Detail parameters</b>
92+
```php
93+
94+
Usage: Ming2PCACluster -InVCF <in.vcf.gz> -OutPut <outPrefix>
95+
96+
-InVCF <str> Input SNP VCF Format
97+
-InGenotype <str> InPut Genotype File
98+
-InSTOgem <str> InPut STOmics gem File of MIDCounts(beta)
99+
-InKinship <str> Input SNP K Kinship File Format
100+
-OutPut <str> OutPut File Prefix(Kinship PCA etc)
101+
102+
103+
-SubPop <str> SubGroup Sample File List[ALLsample]
104+
-Method <int> Method of Kinship [1-4],defaut [1]
105+
1:BaldingNicolsKinship(VanRaden/Normalized_IBS)
106+
2:IBSKinshipImpute 3:IBSKinship 4:p_dis
107+
108+
-help Show more Parameters and help [hewm2008]
109+
110+
111+
-MAF <float> Min minor allele frequency filter [0.001]
112+
-Fchr <str> Filter the chrX chr[chrX,chrY,X,Y]
113+
-Miss <float> Max ratio of miss allele filter [0.25]
114+
-Het <float> Max ratio of het allele filter [1.00]
115+
-HWE <float> Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0]
116+
-CellBin <int> STOmics cell bin[50]
117+
-KeepRemainVCF keep the VCF after filter
118+
119+
-InSampleGroup <string> In File of sampleGroup info,format(sample groupA)
120+
121+
-PCANum <int> Num of PCA eig [10]
122+
-MaxCluNum <int> Max Cluster Num to find Best K [12]
123+
-BestKRatio <float> Get the best K Cluster by deta-SSE Ratio[0.1]
124+
-STOName <string> STOmics Sample Name STOName
125+
126+
127+
```
128+
129+
</br><b>3.2.2 Other parameters</b>
130+
</br>程序也提供了作图软件perl 作图脚本(这个脚本后面将会优化更动较大,主要是最近时间较忙),作图脚本的简要参数说明如下:
131+
132+
```php
133+
ploteig -h
134+
135+
Version:1.0 hewm2008@gmail.com
136+
137+
Usage: ploteig -InPCA pca.eigenvec -OutPrefix Fig
138+
139+
140+
Options
141+
142+
-InPCA <s> : InPut File of PCA
143+
-OutPrefix <s> : OutPut file prefix
144+
145+
-BinDir <s> : The Bin Dir of gnuplot/R/ps2pdf/convert [$PATH]
146+
147+
-help : Show more help [hewm2008]
148+
149+
-columns <s> : the columns to plot a:b [3:4]
150+
-pops <s> : Populations to plot, eg -p GA:GB:GC [ALL]
151+
-border <i> : how to plot the border (1,2,4,8,3,31 ) [3]
152+
-title <s> : title (legend) [PCA]
153+
-keystyle <s> : put key at top right default(in) [outside]box [outside]
154+
-pointsize <i> : point size for plot [3]
155+
156+
157+
```
158+
159+
</br><b>3.3 Output files</b>
160+
161+
162+
|Module | outFlie | Description |
163+
|:-----:|:-------------------|:------------------------------------------------------------|
164+
| List | | |
165+
| |out.kinship |输出的亲缘矩阵,各样品的两两关系 |
166+
| |out.eigenvec |输出最优聚类和pca结果 |
167+
| |out.eigenval |输出最优聚类和pca结果 |
168+
| |out.PCA1_PCA2.pdf |输出按cluster染色后的pca 1 2图 |
169+
| |out.K.pdf |输出cluster K图 |
170+
| |out.cluster |输出的各种K的cluster聚类结果 |
171+
| |Out.cellbin.gz |输出bin50 cell的结果,若是 -InSTOgem |
172+
| |Out.cluster pdf/png |输出坐标cluester图,若是-InSTOgem |
173+
174+
175+
示例图见上面应用场景给的图。示例图和格式当一看即明,相关图可以见example 1 和2
176+
177+
178+
### 4 Example
179+
------------
180+
181+
</br>See more detailed usage in the&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>[Chinese Documentation](https://github.com/hewm2008/MingPCACluster/blob/main/Ming2PCACluster使用手册_manual_chinese.pdf)</b>
182+
</br>See more detailed usage in the&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <b>[English Documentation](https://github.com/hewm2008/MingPCACluster/blob/main/Ming2PCACluster使用手册_manual_chinese.pdf)</b>
183+
</br>See the example directory and Manual.pdf for more detail.
184+
</br>具体见这儿 Manual.pdf for more detail 里面的示意数据和脚本,后期将在某些网址释放一些教程
185+
</br></br>
186+
../../bin/MingPCACluster -InVCF in.vcf.gz -OutPut outPrefix
187+
</br> 目录 Example/example*/ 里面有输入和输出和脚本用法。
188+
189+
190+
* Example 1)千人VCF重测序SNP基因型
191+
</br> 共从K 人数据chr22 dbsnp里面随机挑出了3492个位点,挑 CEU(49) , CHB(46) , JPT(56)和 YRI (52)共203 个样品来分析。
192+
</br>聚类走势,best K
193+
</br>![K_SSE.png](https://github.com/hewm2008/MingPCACluster/blob/main/xxample/Example1/K_SSE.png)
194+
</br>PCA结果
195+
</br>![PCA.png](https://github.com/hewm2008/MingPCACluster/blob/main/xxample/Example1/PCA.png)
196+
197+
198+
* Example 2) cellbin时空细胞表达量pca和聚类
199+
200+
</br>时空分析我初了解主要是:seurat ,我很浅淡的了解,这个包用到的n*m (n是样品,m是位点)的稀疏矩阵,好像周边的做时空的人总说内存很大,我这没有对时空数据敏感,对表达量进行了取log10. 也用了稀疏矩阵 和 n*n, 由于时空n是样品量很大,怕难下降。
201+
</br>初以 我这用了文件大于(File.gem.gz : 177M ), 范围: XXmin: 4975 XXmax: 23374 YYmin: 2525 YYmax: 20724 )。取bin 50, n达到的88507,即主要88507*88507的矩阵double上,占用60.742G (稀疏矩阵5G 矩阵:55G) 。
202+
203+
</br> PCA K Fig
204+
</br>![out1.png](https://github.com/hewm2008/MingPCACluster/blob/main/example/Example2/OUT1.png)
205+
</br> PCA plot Fig
206+
</br>![out2.png](https://github.com/hewm2008/MingPCACluster/blob/main/example/Example2/OUT2.png)
207+
</br> STOmics Cluster plot Fig
208+
</br>![out3.png](https://github.com/hewm2008/MingPCACluster/blob/main/example/Example2/OUT3.png)
209+
210+
211+
### 5 Advantages
212+
213+
</br>速度快,少内存 fast speed, low memory
214+
</br>简明易用 Simple and easy to use
215+
</br>免安装 Free installation
216+
217+
218+
### 6 Discussing
219+
------------
220+
- [:email:](https://github.com/hewm2008/MingPCACluster) hewm2008@gmail.com / hewm2008@qq.com
221+
- join the<b><i> QQ Group : 125293663</b></i>
222+
223+
######################swimming in the sky and flying in the sea #############################
224+

0 commit comments

Comments
 (0)