Completion of draft bacterial genomes by long-read sequencing of synthetic genomic pools
Completion of draft bacterial genomes by long-read sequencing of synthetic genomic pools
通过合成基因组库的长读测序完成细菌基因组草图
BMC Genomics volume 21, Article number: 519 (2020)
Abstract
Background
Illumina technology currently dominates bacterial genomics due to its high read accuracy and low sequencing cost. However, the incompleteness of draft genomes generated by Illumina reads limits their application in comprehensive genomics analyses. Alternatively, hybrid assembly using both Illumina short reads and long reads generated by single molecule sequencing technologies can enable assembly of complete bacterial genomes, yet the high per-genome cost of long-read sequencing limits the widespread use of this approach in bacterial genomics. Here we developed a protocol for hybrid assembly of complete bacterial genomes using miniaturized multiplexed Illumina sequencing and non-barcoded PacBio sequencing of a synthetic genomic pool (SGP), thus significantly decreasing the overall per-genome cost of sequencing.
Illumina技术由于其高读取精度和低测序成本,目前在细菌基因组学领域占据主导地位。
然而,Illumina reads产生的基因组草案的不完整性限制了它们在综合基因组学分析中的应用。
另外,使用Illumina短读序列和单分子测序技术产生的长读序列的杂交组装可以组装完整的细菌基因组,但是长读测序的高成本限制了这种方法在细菌基因组学中的广泛应用。
在此,我们开发了一种使用小型多路复用Illumina测序和合成基因组池(SGP)非条形码PacBio测序的完整细菌基因组杂交组装方案,从而显著降低了每个基因组测序的总成本。
Results
We evaluated the performance of SGP hybrid assembly on the genomes of 20 bacterial isolates with different genome sizes, a wide range of GC contents, and varying levels of phylogenetic relatedness. By improving the contiguity of Illumina assemblies, SGP hybrid assembly generated 17 complete and 3 nearly complete bacterial genomes. Increased contiguity of SGP hybrid assemblies resulted in considerable improvement in gene prediction and annotation. In addition, SGP hybrid assembly was able to resolve repeat elements and identify intragenomic heterogeneities, e.g. different copies of 16S rRNA genes, that would otherwise go undetected by short-read-only assembly. Comprehensive comparison of SGP hybrid assemblies with those generated using multiplexed PacBio long reads (long-read-only assembly) also revealed the relative advantage of SGP hybrid assembly in terms of assembly quality. In particular, we observed that SGP hybrid assemblies were completely devoid of both small (i.e. single base substitutions) and large assembly errors. Finally, we show the ability of SGP hybrid assembly to differentiate genomes of closely related bacterial isolates, suggesting its potential application in comparative genomics and pangenome analysis.
我们评估了SGP杂交组装在20个基因组大小不同、GC含量不同、系统发育亲缘程度不同的细菌分离株基因组上的表现。
通过提高光照体装配的连续性,SGP杂交装配产生了17个完整和3个接近完整的细菌基因组。
SGP杂交组合的接近性提高了基因预测和注释的准确性。
此外,SGP杂交装配还能够解析重复元件,识别基因组内的异质性,如16S rRNA基因的不同拷贝,而短只读装配无法检测到这些异质性。
将SGP混合装配与多路复用PacBio长读(long-只读装配)生成的混合装配进行综合比较,也可以看出SGP混合装配在装配质量上的相对优势。
特别是,我们观察到SGP杂交组装完全没有小的(即单碱基替换)和大的装配误差。
最后,我们展示了SGP杂交组装的能力来区分密切相关的细菌分离株的基因组,提示其在比较基因组学和泛盘古菌分析方面的潜在应用。
Conclusion
Our results indicate the superiority of SGP hybrid assembly over both short-read and long-read assemblies with respect to completeness, contiguity, accuracy, and recovery of small replicons. By lowering the per-genome cost of sequencing, our parallel sequencing and hybrid assembly pipeline could serve as a cost effective and high throughput approach for completing high-quality bacterial genomes.
我们的结果表明,SGP杂交装配体在完整性、连续性、准确性和小复制子的恢复方面都优于短读和长读装配体。
通过降低每个基因组的测序成本,我们的平行测序和混合组装管道可以成为一种成本有效和高通量的方法,以完成高质量的细菌基因组。
Background
De novo genome assembly is a valuable tool for studying the biology of bacteria. From understanding the evolutionary processes underlying host adaptation [1] and development of drug resistance [2, 3], to investigating the genetic diversity among closely related bacteria [4, 5], to identification of novel biosynthetic gene clusters for discovery of therapeutically relevant natural products [6, 7], bacterial genomics research relies on accurate reconstruction of genomes from DNA sequencing reads. Currently, Illumina sequencing dominates the genomics field due to its low error rate and ever decreasing per-base cost of sequencing [8]. However, reads generated by Illumina platforms are typically shorter than repeat elements in bacterial genomes [9]. Consequently, de novo assembly using short reads often fails to resolve the majority of repeats in bacterial genomes, resulting in unfinished final assemblies composed of fragmented contiguous sequences (contigs) [10]. These draft genomes usually contain assembly errors that are problematic for accurate prediction of protein coding sequences (CDSs) and gene annotation [11].
On the other hand, single molecule sequencing technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies generate sequencing reads of several kilobases, which can resolve the majority of repeat elements in bacterial genomes and improve the contiguity of assemblies. However, long reads generated by these platforms are error-prone [12], resulting in the introduction of single base substitutions and small insertions/deletions (indels) into the final assembly [13]. By taking advantage of both the accuracy of Illumina sequencing and the read length of single molecule sequencing, hybrid de novo assembly can resolve the majority of complex genomic structures (e.g. repetitive mobile elements) without compromising the accuracy of the final assembly [10, 13, 14]. The main limitation of this approach is its high per-genome cost of sequencing, particularly for preparing multiplexed (barcoded) long-read libraries, which can be limiting for large scale microbial genomics studies.
To address this limitation, we devised a methodological framework for hybrid sequencing and assembly of complete bacterial genomes without the need for multiplexing PacBio libraries. The driving idea behind our approach is that contigs generated by de novo assembly of barcoded short reads can be leveraged for sorting non-barcoded long reads of individual genomes within a moderately complex synthetic genomic pool. Subsequently, sorted long reads can be used to scaffold and resolve fragmented short-read assemblies via hybrid de novo assembly [10].
从头基因组组装是研究细菌生物学的一个有价值的工具。
从了解进化过程的底层主机适应[1]和[2、3]耐药性的发展,调查之间的遗传多样性密切相关的细菌(4、5),识别新的生物合成基因簇的发现治疗有关天然产物[6、7],细菌基因组研究依赖于精确重建的基因组DNA测序读。
目前,Illumina测序以其低错误率和不断降低的测序成本在基因组学领域占据主导地位。
然而,Illumina平台产生的reads通常比细菌基因组[9]中的重复元件短。
因此,使用短读的从头组装常常无法解决细菌基因组中的大部分重复,导致由片段相连序列(contigs)[10]组成的最终组装未完成。
这些草图基因组通常包含装配错误,这对准确预测蛋白质编码序列(CDSs)和基因注释[11]是有问题的。
另一方面,如Pacific Biosciences (PacBio)和Oxford Nanopore等单分子测序技术可以产生数千碱基的测序读码,可以解决细菌基因组中大部分重复元素的问题,提高装配体的连续性。
但是,这些平台生成的长读取很容易出错,导致在最终的组装[13]中引入了单碱基替换和小的插入/删除(indels)。
利用Illumina测序的准确性和单分子测序的阅读长度,杂交从头组装可以解决大部分复杂的基因组结构(如重复移动元件),而不影响最终组装的准确性[10,13,14]。
这种方法的主要局限性是每个基因组的测序成本高,特别是在准备多路(条形码)长时间阅读的文库时,这可能会限制大规模微生物基因组学研究。
为了解决这一局限性,我们设计了一种不需要多重PacBio文库的杂交测序和完整细菌基因组组装方法框架。
我们的方法背后的驱动思想是,由条形码短读片段的重新组装产生的contigs,可以用于在一个中等复杂的合成基因组池中对单个基因组的非条形码长读片段进行排序。
随后,经过排序的长读可以通过混合从头组装[10]来支架和分解片段短读组装。
Results
A schematic overview of the sequencing workflow and bioinformatics pipeline used for performing SGP hybrid assembly is provided in Fig. 1. We evaluated the precision of this protocol by sequencing the genomes of 20 isolates of the human gut microbiota, with different genome sizes (2.58–6.60 Mbp), GC contents (31.38–63.38%), and genomic similarity (Mash distances ranging from 0.00002–1.00; Additional file 1). By combining the genomic DNA of these isolates into a synthetic genomic pool (total size ~ 77 Mbp), we considerably reduced both the hands-on time and the cost of preparing long-read sequencing libraries compared to the standard PacBio multiplexing protocol (see Additional file 2 for a detailed comparison of the cost of SMRT library preparation between the SGP and standard multiplexing approach). Of the 20 genomes included in the SGP, we were able to assemble 17 complete genomes, 2 nearly complete ones (including Alistipes onderdonkii GC304, genome size = 3.75Mbp, N50 = 3.73Mbp, chromosomal contigs = 3; and Coprobacillus cateniformis GC273, genome size = 3.69Mbp, N50 = 3.68Mbp, chromosomal contigs = 3), and one partially fragmented genome (Bacteroides dorei GC431; genome size 5.93Mbp, N50 = 4.01Mbp, chromosomal contigs = 11, extrachromosomal circular assemblies = 7) (Fig. 2a and Additional file 3).
posted on 2020-09-30 20:07 王闯wangchuang2017 阅读(154) 评论(0) 编辑 收藏 举报