The advantages and disadvantages of short- and long-read metagenomics to infer bacterial and eukaryotic community composition
短读宏基因组学和长读宏基因组学在推断细菌和真核生物群落组成方面的优缺点
Abstract
Background The first step in understanding ecological community diversity and dynamics is quantifying community membership. An increasingly common method for doing so is through metagenomics. Because of the rapidly increasing popularity of this approach, a large number of computational tools and pipelines are available for analysing metagenomic data. However, the majority of these tools have been designed and benchmarked using highly accurate short read data (i.e. illumina), with few studies benchmarking classification accuracy for long error-prone reads (PacBio or Oxford Nanopore). In addition, few tools have been benchmarked for non-microbial communities.
Results Here we use simulated error prone Oxford Nanopore and high accuracy Illumina read sets to systematically investigate the effects of sequence length and taxon type on classification accuracy for metagenomic data from both microbial and non-microbial communities. We show that very generally, classification accuracy is far lower for non-microbial communities, even at low taxonomic resolution (e.g. family rather than genus).
Conclusions We then show that for two popular taxonomic classifiers, long error-prone reads can significantly increase classification accuracy, and this is most pronounced for non-microbial communities. This work provides insight on the expected accuracy for metagenomic analyses for different taxonomic groups, and establishes the point at which read length becomes more important than error rate for assigning the correct taxon.
背景
理解生态群落多样性和动态的第一步是量化群落成员。
一种越来越普遍的方法是通过宏基因组学。
由于这种方法的迅速流行,有大量的计算工具和管道可用于分析宏基因组数据。
然而,这些工具中的大多数都是使用高度精确的短读数据(例如illumina)设计和基准测试的,很少有研究对容易出错的长读数据(PacBio或Oxford Nanopore)的分类精度进行基准测试。
此外,很少有工具被作为非微生物群落的基准。
结果
在这里,我们使用模拟易出错的牛津纳米孔和高精度Illumina read集,系统地研究序列长度和分类单元类型对微生物和非微生物群落元基因组数据分类精度的影响。
我们发现,一般来说,非微生物群落的分类精度要低得多,即使分类分辨率很低(例如,科而不是属)。
结论
然后我们表明,对于两种流行的分类分类器,长时间容易出错的读取可以显著提高分类精度,这对非微生物群落最为明显。
这项工作提供了对不同分类组的元基因组分析的预期准确性的见解,并建立了点,在这一点上,读取长度变得比错误率更重要的分配正确的分类单元。
posted on 2020-09-30 20:57 王闯wangchuang2017 阅读(136) 评论(0) 编辑 收藏 举报
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具