Nanopore sequencing and assembly of a human genome with ultra-long reads
Nanopore sequencing and assembly of a human genome with ultra-long reads
Nature Biotechnology volume 36, pages338–345(2018)
Abstract
We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ∼30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ∼3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ∼6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.
Main
The human genome is used as a yardstick to assess performance of DNA sequencing instruments1,2,3,4,5. Despite improvements in sequencing technology, assembling human genomes with high accuracy and completeness remains challenging. This is due to size (∼3.1 Gb), heterozygosity, regions of GC% bias, diverse repeat families, and segmental duplications (up to 1.7 Mbp in size) that make up at least 50% of the genome6. Even more challenging are the pericentromeric, centromeric, and acrocentric short arms of chromosomes, which contain satellite DNA and tandem repeats of 3–10 Mb in length7,8. Repetitive structures pose challenges for de novo assembly using “short read” sequencing technologies, such as Illumina's. Such data, while enabling highly accurate genotyping in non-repetitive regions, do not provide contiguous de novo assemblies. This limits the ability to reconstruct repetitive sequences, detect complex structural variation, and fully characterize the human genome.
Single-molecule sequencers, such as Pacific Biosciences' (PacBio), can produce read lengths of 10 kb or more, which makes de novo human genome assembly more tractable9. However, single-molecule sequencing reads have significantly higher error rates compared with Illumina sequencing. This has necessitated development of de novo assembly algorithms and the use of long noisy data in conjunction with accurate short reads to produce high-quality reference genomes10. In May 2014, the MinION nanopore sequencer was made available to early-access users11. Initially, the MinION nanopore sequencer was used to sequence and assemble microbial genomes or PCR products12,13,14 because the output was limited to 500 Mb to 2 Gb of sequenced bases. More recently, assemblies of eukaryotic genomes including yeasts, fungi, and Caenorhabditis elegans have been reported15,16,17.
Recent improvements to the protein pore (a laboratory-evolved Escherichia coli CsgG mutant named R9.4), library preparation techniques (1D ligation and 1D rapid), sequencing speed (450 bases/s), and control software have increased throughput, so we hypothesized that whole-genome sequencing (WGS) of a human genome might be feasible using only a MinION nanopore sequencer17,18,19.
We report sequencing and assembly of a reference human genome for GM12878 from the Utah/CEPH pedigree, using MinION R9.4 1D chemistry, including ultra-long reads up to 882 kb in length. GM12878 has been sequenced on a wide variety of platforms, and has well-validated variation call sets, which enabled us to benchmark our results20.
纳米孔测序和组装人类基因组超长读取
Miten Jain, Sergey Koren,[…]Matthew Loose
《自然生物技术》卷36,第338 - 345页(2018)引用了这篇文章
55 k访问
433年引用
1510年Altmetric
Metricsdetails
摘要
我们报告了使用MinION(牛津纳米孔技术)纳米孔测序器对人类GM12878 Utah/Ceph细胞系的参考基因组测序和组装。
生成的序列数据为91.2 Gb,表示约为30×理论覆盖范围。
基于参考的比对使大型结构变异和表观遗传修饰的检测成为可能。
纳米孔的重新组装单独产生一个相邻的组装(NG50 ~ 3 Mb)。
我们开发了一个协议来生成超长读取(N50 >
100kb,读取长度为882 kb)。
将这些超长读取数据的5倍覆盖范围加入进来,程序集的连续性增加了一倍多(NG50约为6.4 Mb)。
最终组装的基因组有28.67亿个碱基,覆盖了参考文献的85.8%。
在加入互补短读测序数据后,装配精度超过99.8%。
超长reads使4-Mb主组织相容性复合体(MHC)位点得以完整组装和分阶段,测量端粒重复长度,并关闭参考人类基因组组装GRCh38中的空白。
主要
人类基因组被用作评价DNA测序仪器s1、2、3、4、5性能的尺度。
尽管测序技术有所进步,但高精度和完整性的人类基因组组装仍然具有挑战性。
这是因为体积(约为3.1 Gb)、杂合性、GC%偏倚区域、不同的重复家族和片段重复(大小高达170 Mbp)至少占基因组的50%。
更具挑战性的是染色体的内包熵短臂、着丝粒短臂和跨中心短臂,它们包含卫星DNA和长度为7,8 Mb的串联重复序列。
重复的结构给使用“短读”测序技术(如Illumina)的从头组装带来了挑战。
这样的数据,虽然能够在非重复区域进行高精度的基因分型,但不能提供连续的从头组装。
这限制了重建重复序列、检测复杂结构变异和充分描述人类基因组的能力。
像太平洋生物科学公司(PacBio)这样的单分子测序器可以产生10kb或更多的读取长度,这使得人类基因组的重新组装更加容易处理。
然而,与Illumina测序相比,单分子测序的误差率明显更高。
这就需要从头组装算法的发展,以及长噪声数据与精确短读的结合使用,以产生高质量的参考基因组10。
2014年5月,MinION nanopore测序仪向早期用户开放。
最初,MinION nanopore测序器用于测序和组装微生物基因组或PCR产物12,13,14,因为输出限制在500mb至2gb的测序碱基。
最近,包括酵母菌、真菌和秀丽隐杆线虫在内的真核生物基因组的装配被报道15,16,17。
最近改善蛋白质孔隙(一个名叫R9.4 laboratory-evolved大肠杆菌CsgG突变),图书馆准备技术(1 d结扎和1 d快速)测序速度(450基地/ s)和控制软件增加了吞吐量,所以我们假设全基因组测序人类基因组(WGS)可能可行的使用只有一个奴才纳米孔sequencer17, 18、19。
我们报道了犹他/CEPH家系GM12878参考人类基因组的测序和组装,使用MinION R9.4 1D化学,包括长达882 kb的超长读取。
GM12878已经在各种各样的平台上进行了测序,并具有经过充分验证的变异调用集,这使我们能够对结果进行基准测试20。
posted on 2020-09-30 21:08 王闯wangchuang2017 阅读(249) 评论(0) 编辑 收藏 举报
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具