推荐 | VCF2PCACluster快速实现超大规模SNP的PCA分析

推荐一个高效做PCA分析的工具，从VCF文件到直接出图的一键式分析，对于小白非常友好。小明哥刚开发时小编就在用了，那时还叫MingPCACluster，今年五月终于发表见刊，恭喜。

VCF2PCACluster 是基于群体SNP数据VCF格式开发的PCA分析和聚类软件，同时兼并了Genotype 等格式软件，即只要对应的一个输入文件进来，这PCA和作图分组等一步到位。简单、易用和高效。其中主要功能有：

SNP位点过滤：如三碱基,MAF等
5种算法计算亲缘关系矩阵kinship
基于kinship进行 PCA分析
3种聚类算法对PCA的结果进行聚类分析
基于PCA的结果和聚类结果进行可视化

主要亮点：一步高效生成PCA和聚类图。其中为了强调核心是高效低内存, 一步操作，一个输入到PCA结果和出图，对用户友好。

地址：https://github.com/hewm2008/VCF2PCACluster

小编使用的体验确实比其他大多数软件要快多了。

安装

git clone https://github.com/hewm2008/VCF2PCACluster.git
cd VCF2PCACluster;	chmod 755 -R bin/*
./bin/VCF2PCACluster  -h  ### print help information

主要参数



	# for more Help document please see the manual.	Para [-i] is show for [-InVCF], Para [-o] is show for [-OutPut]

	Usage: VCF2PCACluster  -InVCF in.vcf.gz  -OutPut outPrefix [options]

		-InVCF         <str>      Input SNP VCF Format
		-InKinship     <str>      Input SNP K Kinship File Format
		-OutPut        <str>      OutPut File Prefix(Kinship PCA etc)

		-KinshipMethod <int>      Method of Kinship [1-5],defaut [1]
		                          1:Normalized_IBS(Yang/BaldingNicolsKinship)
		                          2:Centered_IBS(VanRaden)
		                          3:IBSKinshipImpute 4:IBSKinship 5:p_dis
		-ClusterMethod <str>      Method For Cluster[EM/Kmean/DBSCAN/None] [EM]

		-help          v1.40      Show more Parameters and help [hewm2008]

	    InFile:
		-InGenotype    <str>      InPut Genotype File for no VCF file
		-InSubSample   <str>      Only keep samples from subsample List for PCA[ALLsample]
		-InSampleGroup <str>      InFile of sample Group info,format(sample groupA)

	    SNP Filtering:
		-MAF           <float>    Min minor allele frequency filter [0.001]
		-Miss          <float>    Max ratio of miss allele filter [0.25]
		-Het           <float>    Max ratio of het allele filter [1.00]
		-HWE           <float>    Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0]
		-Fchr          <str>      Filter the chrX chr[chrX,chrY,X,Y]
		-KeepRemainVCF            keep the VCF after filter

	    Clustering:
		-RandomCenter             Random diff-center to Re-Run Cluster for Kmean
		-BestKManually <int>      manually set the Best K (Num of Cluster) (auto)
		-BestKRatio    <float>    Get the best K Cluster by deta-SSE Ratio[0.15]
		-MinPointNum   <int>      Minimum point number of D-cluster[4]
		-Epsilon       <float>    Epsilon for DBSCAN_Distance/EM_convergence (auto)
		-Iterations    <int>      iterations number for EM clustering[1000]

	    OutPut:
		-PCnum         <int>      Num of PC eig [10]

软件中英文文档已经写得非常详细，具体查看：https://github.com/hewm2008/VCF2PCACluster

美中不足的是我们往往并不需要cluster的结果，所以这最好是作为一个选项，不然我还是得自己绘图，那何不用Plink呢？

唠叨

小明开发的生信工具主打一个低调、简单、实用，比如LDBlockShow、PopLDdecay、RectChr、Reseqtools、NGenomeSyn等。小编作为他的前同事，早已经成为这些软件的忠实粉丝，希望他能继续开发出好用的生信工具。也欢迎大家多多使用和引用。

作者：郭小野

posted @ 2024-06-15 22:09 生物信息与育种阅读(42) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

生物信息与育种

生信、AI、大数据与育种相关，微信公众号：生物信息与育种

推荐 | VCF2PCACluster快速实现超大规模SNP的PCA分析

安装

主要参数

唠叨

公告

生物信息与育种

生信、AI、大数据与育种相关，微信公众号：生物信息与育种

推荐 | ​VCF2PCACluster快速实现超大规模SNP的PCA分析

安装

主要参数

唠叨

公告

推荐 | VCF2PCACluster快速实现超大规模SNP的PCA分析