基因组组装工具之 SOAPdenovo 使用方法

SOAPdenovo是一个新颖的适用于组装短reads的方法，能组装出类似人类基因组大小的de novo草图。

该软件特地设计用来组装Illumina GA short reads，新的版本减少了在图创建时的内存消耗，解决了contig组装时的重复区域的问题，增加了scaffold组装时的覆盖度和长度，改进了gap closing，更加适用于大型基因组组装。

（SOAPdenovo是为了组装大型植物和动物基因组而设计的，同样也适用于组装细菌和真菌，组装大型基因组大小如人类时，可能需要150G内存。）

1.配置文件

一般大型基因组组装项目都会有多个文库，配置文件包含文库的位置信息以及其他信息。

配置文件包含全局信息和多个文库部分信息。

全局信息：max_rd_len：任何比它大的read会被切到这个长度。

文库部分由[LIB]开始，并包含如下信息：

1) avg_ins

文库的平均插入长度，或者是插入长度分布图的峰值。（科普：理论上插入片段长度是成正态分布的，并不是严格控制的）
2) reverse_seq

这个选项有 0 或 1 两个选项，它告诉组装器read序列是否需要被完全反转。Illumima GA 产生两种 paired-end 文库：一是forward-reverse；另一个是 reverse-forward。"reverse_seq"参数应该如下设置：0，forward-reverse（由典型的插入长度少于500 bp的DNA末端片段生成）；1，reverse-forward（由环状文库，典型的2 kb以上的文库生成）。

3) asm_flags

决定reads哪一段会被利用，1（仅进行contig组装）；2（仅进行scaffold组装）；3（contig和scaffold都组装）；4（只进行gap closure）。
4) rd_len_cutof

组装器会过滤掉当前文库中到这个长度之间的reads。
5) rank

为整数值，它决定在scaffold组装时reads被利用的顺序。文库中具有同样rank值的会被同时使用（在组装scaffold时）。
6) pair_num_cutoff

该参数是成对number的 cutoff value，为了得到两条contigs的可靠的连接或 pre-scaffolds。paired-end reads and mate-pair reads 的最小数量分别是 3 和 5.
7) map_len

这个参数在“map”阶段生效，它是read 和 contig 的最小比对长度，用来建立一个可靠的read定位。

paired-end reads and mate-pair reads 的最小的长度分别是 32 和 35.

组装器接受三种read格式：FASTA, FASTQ and BAM。

Mate-pair关系：fastq中两个文件的同行序列；fasta中的邻行序列，bam文件比较特殊。

配置文件中，单端文件用"f=/path/filename" or "q=/pah/filename" 表示 fasta or fastq 格式。

双端reads被放在两个fasta文件中，分别为"f1=" and "f2="。fastq文件由"q1=" and "q2="表示。

双端reads如果全在一个fasta文件中，则用"p=" 选项；reads在bam文件中则用"b=".选项。

以上参数大多是可选的，如果你不知道怎么用，可以不设置，让软件使用默认参数。

2.命令及参数

常用的一站式运行方式：

${bin} all -s config_file -K 63 -R -o graph_prefix 1>ass.log 2>ass.err

分四步运行：

${bin} pregraph -s config_file -K 63 -R -o graph_prefix 1>pregraph.log 2>pregraph.err
OR
${bin} sparse_pregraph -s config_file -K 63 -z 5000000000 -R -o graph_prefix 1>pregraph.log 2>pregraph.err

${bin} contig -g graph_prefix -R 1>contig.log 2>contig.err

${bin} map -s config_file -g graph_prefix 1>map.log 2>map.err

${bin} scaff -g graph_prefix -F 1>scaff.log 2>scaff.err

all (pregraph-contig-map-scaff)的参数

-s <string>    配置文件：config

  -o <string>    输出图：输出图文件名的前缀

  -K <int>       kmer(最小 13, 最大 63/127): kmer size, [23]

  -p <int>       cpu核数 [8]

  -a <int>       初始的内存：避免内存再分配，单位为G [0]

  -d <int>       KmerFreqCutoff: kmers with frequency no larger than KmerFreqCutoff will be deleted, [0]

  -R (optional)  resolve repeats by reads, [NO]


  -D <int>       EdgeCovCutoff: edges with coverage no larger than EdgeCovCutoff will be deleted, [1]

  -M <int>       mergeLevel(min 0, max 3): the strength of merging similar sequences during contiging, [1]

  -m <int>       max k when using multi kmer

  -e <int>       weight to filter arc when linearize two edges(default 0)

  -r (optional)  keep available read(*.read)

  -E (optional)  merge clean bubble before iterate

  -f (optional)  output gap related reads in map step for using SRkgf to fill gap, [NO]

  -k <int>       kmer_R2C(min 13, max 63): kmer size used for mapping read to contig, [K]

  -F (optional)  fill gaps in scaffold, [NO]

  -u (optional)  un-mask contigs with high/low coverage before scaffolding, [mask]

  -w (optional)  keep contigs weakly connected to other contigs in scaffold, [NO]

  -G <int>       gapLenDiff: allowed length difference between estimated and filled gap, [50]

  -L <int>       minContigLen: shortest contig for scaffolding, [K+2]

  -c <float>     minContigCvg: minimum contig coverage (c*avgCvg), contigs shorter than 100bp with coverage smaller than c*avgCvg will be masked before scaffolding unless -u is set, [0.1]

  -C <float>     maxContigCvg: maximum contig coverage (C*avgCvg), contigs with coverage larger than C*avgCvg or contigs shorter than 100bp with coverage larger than 0.8*C*avgCvg will be masked before scaffolding unless -u is set, [2]

  -b <float>     insertSizeUpperBound: (b*avg_ins) will be used as upper bound of insert size for large insert size ( > 1000) when handling pair-end connections between contigs if b is set to larger than 1, [1.5]

  -B <float>     bubbleCoverage: remove contig with lower cvoerage in bubble structure if both contigs coverage are smaller than bubbleCoverage*avgCvg, [0.6]

  -N <int>      基因组大小 [0]

  -V (optional)  组装的可视化信息输出 [NO]

学到的基本概念：

参考资料：

SOAPdenovo官方网站

SOAPdenovo软件使用说明

posted @ 2016-06-22 11:01 Life·Intelligence 阅读(13129) 评论(0) 编辑收藏举报

刷新页面返回顶部

（评论功能已被禁用）

2025年3月

日

一

二

三

四

五

六

Digital-LI

基因组组装工具之 SOAPdenovo 使用方法

1.配置文件

2.命令及参数

all (pregraph-contig-map-scaff)的参数

学到的基本概念：

参考资料：

搜索

我的标签

积分与排名

阅读排行榜