wangchuang2017

15675871637 WeChat wangchuang2022 QQ 2545804152 wangchuang2017@hunnu.edu.cn

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

  1. Aleksey V. Zimin1,2
  2. Daniela Puiu1
  3. Ming-Cheng Luo3
  4. Tingting Zhu3
  5. Sergey Koren4
  6. Guillaume Marçais2,5
  7. James A. Yorke2,6
  8. Jan Dvořák3 and 
  9. Steven L. Salzberg1,7

+Author Affiliations

  1. 1Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA;
  2. 2Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA;
  3. 3Department of Plant Sciences, University of California, Davis, California 95616, USA;
  4. 4National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
  5. 5Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;
  6. 6Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA;
  7. 7Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA
  1. Corresponding author: salzberg@jhu.edu

Abstract

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

由单分子测序技术产生的长测序reads提供了大幅度提高基因组装配的连续性的可能性。
目前最大的挑战是长读的错误率相对较高,目前约为15%
高错误率使得单独使用这些数据非常困难,特别是对于高度重复的植物基因组。
原始数据中的错误会导致一致基因组序列中的插入或删除错误(indels),从而给下游分析带来重大问题;
例如,单个indel可能会改变读码框并错误地截断蛋白质序列。
在这里,我们描述了一种解决高错误率问题的算法,该算法结合了长、高错误率的reads和更短但更准确的Illumina测序reads,其错误率平均为1%。
我们的混合装配算法将这两种读取结合起来,构造出既长又准确的巨读,然后使用CABOG装配器对巨读进行装配,CABOG装配器是专为长读设计的。
我们将这项技术应用于一个巨大的Illumina和PacBio序列的数据集,该序列来自于Aegilops tauschii物种,这是一个巨大的、极其重复的植物基因组,以前的组装尝试都没有成功。
我们发现,最终组装的contig比以往任何组装的都要大,N50的contig大小为486,807个核苷酸。
我们将contigs与独立制作的光学地图进行比较,以评估其大规模精度,并将其与一套高质量的细菌人工染色体(BAC)组件进行比较,以评估基础水平的精度。

Footnotes

  • Received July 26, 2016.
  • Accepted January 18, 2017.

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

posted on 2020-10-04 20:20  王闯wangchuang2017  阅读(128)  评论(0编辑  收藏  举报

导航