Abstract

Motivation

Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited.

Results

In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type.

动机
纳米孔长读测序技术为高通量短读测序提供了很有前途的替代品,特别是在RNA测序领域。
然而,这种技术目前受到输出数据的高错误率的阻碍,这些错误率会影响诸如鉴定异构体、外显子边界、开放阅读框和创建基因目录等分析。
由于这些数据的新新性,计算方法仍在积极开发中,纳米孔rna测序长读的纠错选项仍然有限。

结果
在这篇文章中,我们评估现有的长读DNA纠错方法能够纠正cDNA纳米孔读错的程度。
我们提供了一个自动的和广泛的基准工具,不仅报告经典的错误修正指标,而且还修正对基因家族,亚型多样性,对主要亚型的偏差和剪接位点检测的影响。
我们发现,最初为DNA开发的长读纠错工具也适用于校正纳米孔rna测序数据,特别是在提高碱基对精度方面。
然而,研究者应该被警告的是,校正过程扰乱了基因家族的大小和亚型多样性。
根据应用程序类型的不同,本文提供了应该使用哪些(或者是否)错误纠正工具的指南。

Introduction

The most commonly used technique to study transcriptomes is through RNA-sequencing. As such, many tools were developed to process Illumina or short RNA-seq reads. Assembling a transcriptome from short reads is a central task for which many methods are available. When a reference genome or reference transcriptome is available, reference-based assemblers can be used (such as Cufflinks [1], Scallop [2], Scripture [3] and StringTie [4]). When no references are available, de novo transcriptome assembly can be performed (using tools such as Oases [5], SOAPdenovo-Trans [6], Trans-ABySS [7] and Trinity [8]). Potential disadvantages of reference-based strategies include the following:

(i) the resulting assemblies might be biased toward the used reference, and true variations might be discarded in favour of known isoforms;

(ii) they are unsuitable for samples with a partial or missing reference genome [8];

(iii) such methods depend on correct read-to-reference alignment, a task that is complicated by splicing, sequencing errors, polyploidism, multiple read mapping, mismatches caused by genome variation and the lack or incompleteness of many reference genomes [78];

and (iv) sometimes, the model being studied is sufficiently different from the reference because it comes from a different strain or line such that the mappings are not altogether reliable [5].

On the other hand, some of the shortcomings of de novo transcriptome assemblers are as follows:

(i) low-abundance transcripts are likely to not be fully assembled [9];

(ii) reconstruction heuristics are usually employed, which may lead to missing alternative transcripts, and highly similar transcripts are likely to be assembled into a single transcript [10];

(iii) homologous or repetitive regions may result in incomplete assemblies [11];

and (iv) accuracy of transcript assembly is called into question when a gene exhibits complex isoform expression [11].

研究转录组最常用的技术是rna测序。
因此,许多工具被开发来处理Illumina或简短的RNA-seq读取。
从短读中组装转录组是一个中心任务,有许多方法可用。
当参考基因组或参考转录组可用时,可以使用基于参考的装配程序(如Cufflinks [1], Scallop [2], Scripture [3] and StringTie [4])。
当没有可用的引用时,可以执行从头转录组组装(使用诸如Oases[5]、soapnovov - trans[6]、Trans-ABySS[7]和Trinity[8]等工具)。
基于引用的策略的潜在缺点包括:

(i)最终的程序集可能会偏向于使用的引用,真实的变化可能会被抛弃而有利于已知的亚型;
(ii)不适合参考基因组部分或缺失[8]的样品;
(iii)这些方法依赖于正确的读-引用比对,而这一任务由于剪接、测序错误、多倍体、多读映射、基因组变异引起的不匹配以及许多参考基因组缺失或不完整而变得复杂[7,8];
并且(iv)有时,被研究的模型与参考有很大的不同,因为它来自于不同的应变或线,因此映射不完全可靠[5]。
另一方面,从头转录组组装器的一些缺点是:

(i)低丰度的转录本很可能没有被[9]完全组装;
(ii)通常采用重构启发式,这可能导致缺失替代转录本,高度相似的转录本很可能被组装成一个单一的转录本[10];
(iii)同源或重复区域可能导致不完整组装[11];
(iv)当一个基因表现出复杂的亚型表达[11]时,转录本组装的准确性就会受到质疑。

Recent advances in long-read sequencing technology have enabled longer, up to full-length sequencing of RNA molecules. This new approach has the potential to eliminate the need for transcriptome assembly and thus also eliminate from transcriptome analysis pipelines all the biases caused by the assembly step. Long-read sequencing can be done using either cDNA-based or direct RNA protocols from Oxford Nanopore (referred to as ‘ONT’ or ‘Nanopore’) and Pacific Biosciences (PacBio). The Iso-Seq protocol from PacBio consists in a size selection step, sequencing of cDNAs and finally a set of computational steps that produce sequences of full-length transcripts. ONT has three different experimental protocols for sequencing RNA molecules: cDNA transformation with amplification, direct cDNA (with or without amplification) and direct RNA.

Long-read sequencing is increasingly used in transcriptome studies, not just to prevent problems caused by short-read transcriptome assembly, but also for several of the following reasons. Mainly, long reads can better describe exon/intron combinations [12]. The Iso-Seq protocol has been used for isoform identification, including transcripts identification [13], de novo isoform discovery [14] and fusion transcript detection [15]. Nanopore has recently been used for isoform identification [16] and quantification [17].

The sequencing throughput of long-read technologies is significantly increasing over the years. It is now conceivable to sequence a full eukaryote transcriptome using either only long reads, or a combination of high-coverage long and short (Illumina) reads. Unlike the Iso-Seq protocol that requires extensive in silico processing prior to primary analysis [18], raw Nanopore reads can in principle be readily analysed. Direct RNA reads also permit the analysis of base modifications [19], unlike all other cDNA-based sequencing technologies. There also exist circular sequencing techniques for Nanopore such as INC-Seq [20] that aim at reducing error rates, at the expense of a special library preparation. With raw long reads, it is up to the primary analysis software (typically a mapping algorithm) to deal with sequences that have significant per-base error rate, currently around 13% [21].

最近在长读测序技术方面的进展使得RNA分子的长读测序成为可能。
这种新方法有可能消除对转录组组装的需要,从而也消除了转录组分析管道中组装步骤造成的所有偏差。
长读测序可以使用基于cdna或直接从牛津纳米孔(简称“ONT”或“Nanopore”)和太平洋生物科学(PacBio)的RNA协议来完成。
PacBio的Iso-Seq协议包括大小选择步骤、cDNAs测序以及最终产生全长转录本序列的一组计算步骤。
ONT有三种不同的RNA分子测序实验方案:有扩增的cDNA转化、直接cDNA(有或没有扩增)和直接RNA。

长读测序越来越多地应用于转录组研究,不仅是为了防止短读转录组组装造成的问题,也是出于以下几个原因。
主要来说,长reads能更好地描述外显子/内含子组合[12]。
Iso-Seq协议已用于异构体鉴定,包括转录本鉴定[13],从头发现[14]和融合转录本检测[15]
纳米孔最近被用于[16]的异构体鉴定和[17]的定量。

这些年来,长读技术的测序通量显著增加。
现在可以想象,使用长读段或高覆盖率的长、短读段(Illumina)组合来对一个完整的真核生物转录组进行排序。
不同于Iso-Seq协议,它需要在[18]初级分析之前进行大量的硅处理,原始的纳米孔读码原则上可以很容易地进行分析。
不像其他基于cdna的测序技术,直接RNA读取也允许分析碱基修饰[19]。
还有一些纳米孔的循环测序技术,如inco - seq[20],目的是降低错误率,但需要特殊的文库准备。
对于原始的长读取,主要的分析软件(通常是一个映射算法)来处理具有显著的每基错误率的序列,目前在13%[21]左右。

In principle, a high error rate in the data complicates the analysis of transcriptomes especially for the accurate detection of exon boundaries, or the quantification of similar isoforms and paralogous genes. Reads need to be aligned unambiguously and with high base pair accuracy to either a reference genome or transcriptome. Indels (i.e. insertions/deletions) are the main type of errors produced by long-read technologies, and they confuse aligners more than substitution errors [22]. Many methods have been developed to correct errors in RNA-seq reads, mainly in the short-read era [2324]. They no longer apply to long reads because they were developed to deal with low error rates, and principally substitutions. However, a new set of methods has been proposed to correct genomic long reads. There exist two types of long-read error correction algorithms, those using information from long reads only (‘self’ or ‘non-hybrid’ correction), and those using short reads to correct long reads (‘hybrid’ correction). In this article, we will report on the extent to which state-of-the-art tools enable to correct long noisy RNA-seq reads produced by Nanopore sequencers.

Several tools exist for error-correcting long reads, including ONT reads. Even if the error profiles of Nanopore and PacBio reads are different, the error rate is quite similar and it is reasonable to expect that tools originally designed for PacBio data to also perform well on recent Nanopore data. There is, to the best of our knowledge, very little prior work that specifically addresses error correction of RNA-seq long reads. Notable exceptions include the following: (i) LSC [25], which is designed to error correct PacBio RNA-seq long reads using Illumina RNA-seq short reads; (ii) PBcR [26] and (iii) HALC [27], which are mainly designed for genomes but are also evaluated on transcriptomic data. Here we will take the standpoint of evaluating long-read error correction tools on RNA-seq data, most of which were designed to process DNA sequencing data only.

We evaluate the following DNA hybrid correction tools: HALC [27], LoRDEC [28], NaS [29], PBcR [26] and proovread [30]; and the following DNA self-correction tools: Canu [31], daccord [32], LoRMA [33], MECAT [34] and pbdagcon [35]. We also evaluate an additional hybrid tool, LSC [25], the only one specifically designed to error correct (PacBio) RNA-seq long reads. A majority of hybrid correction methods employ mapping strategies to place short fragments on long reads and correct long read regions using the related short-read sequences. But some of them rely on graphs to create a consensus that is used for correction. These graphs are either k-mer graphs (de Bruijn graphs) or nucleotide graphs resulting from multiple alignments of sequences (partial order alignment). For self-correction methods, strategies using the aforementioned graphs are the most common. We have also considered evaluating nanocorrect [36], nanopolish [36], Falcon_sense [37] and LSCPlus [38], but some tools were deprecated, not suitable for read correction or unavailable. Our detailed justifications can be found in Section S1.12 of the Supplementary Data. We have selected what we believe is a representative set of tools but there also exist other tools that were not considered in this study, e.g. HG-Color [39], HECIL [40], MIRCA [41], Jabba [42], nanocorr [43] and Racon [44]. 

原则上,数据的高错误率使转录组的分析复杂化,特别是准确检测外显子边界,或定量的相似亚型和泛子基因。
读码需要明确地与参考基因组或转录组进行比对,并具有高的碱基对准确性。
Indels(即插入/删除)是长时间读取技术产生的主要类型错误,它们比替换错误[22]更容易混淆对准器。
目前已经开发了许多方法来纠正RNA-seq读取中的错误,主要是在短读时代[23,24]。
它们不再适用于长时间阅读,因为它们被开发用来处理低错误率,主要是替换。
然而,一套新的方法已被提出,以纠正基因组长读。
存在两种类型的长读纠错算法,一种是只使用长读的信息(“self”或“non-hybrid”纠错),另一种是使用短读来纠正长读(“hybrid”纠错)
在这篇文章中,我们将报道在何种程度上最先进的工具能够纠正长噪声rna序列读取产生的纳米孔测序仪。

有一些工具可以纠正长时间读取的错误,包括ONT读取。
即使Nanopore和PacBio读取的误差分布不同,错误率也相当相似,我们有理由期待最初为PacBio数据设计的工具也能在最近的Nanopore数据上表现良好。
就我们所知,以前很少有工作专门针对RNA-seq长读的错误修正。
值得注意的例外包括:

(i) LSC[25],它被设计用于使用Illumina RNA-seq短读错误纠正PacBio RNA-seq长读;
(ii) PBcR[26]

和(iii) HALC[27],它们主要是为基因组设计的,但也在转录组数据上进行评估。
在这里,我们将站在评估RNA-seq数据上的长读错误修正工具的立场,其中大多数是设计来处理DNA测序数据。

我们评估了以下DNA杂交校正工具:HALC [27], LoRDEC [28], NaS [29], PBcR[26]和proovread [30];
以及以下DNA自校正工具:Canu [31], daccord [32], LoRMA [33], MECAT [34], pbdagcon[35]
我们还评估了另一个混合工具LSC[25],它是唯一一个专门设计用于纠错(PacBio) RNA-seq长读取的工具。
大多数混合校正方法都采用映射策略,将短片段放在长读上,并使用相关的短读序列对长读区域进行校正。
但是他们中的一些人依靠图表来建立共识,并用于修正。
这些图要么是k-mer图(de Bruijn图),要么是由多次序列比对(部分顺序比对)得到的核苷酸图。
对于自我校正方法,使用上述图表的策略是最常见的。
我们也考虑过评估nanocorrect [36], nanopolish [36], Falcon_sense[37]和LSCPlus[38],但有些工具已经过时,不适合read校正或不可用。
我们的详细理据可参阅补充资料第S1.12节。
我们选择了一组我们认为具有代表性的工具,但也有其他本研究未考虑的工具,如:HG-Color[39]、HECIL[40]、MIRCA[41]、Jabba[42]、nanocorr[43]和Racon[44]

Table 1

Main characteristics of the error correction tools considered in this study

(A) Hybrid tools
 HALCLoRDECLSCNaSPBcRProovread
Reference  [27 [28 [25 [29 [26 [30
Context  RNA or DNA  DNA  RNA  DNA  RNA or DNA  DNA 
Technology  PacBio  PacBio or ONT  PacBio  ONT  PacBio or ONT  PacBio 
Main algorithmic idea 

Aligns short reads contigs to long reads. HALC implements a strategy to desambiguify multiple alignments instead of avoiding them.

It uses a short reads contig graph in order to select the best set of contigs to correct a long read region. 

Construction of short read de Bruijn graph (dBG), path search between k-mers in long reads.

Regions between k-mers are corrected with the optimal path. 

Operate a homopolymer compression of short and long reads to increase alignment recall.

Recruits short reads on long reads sequences.

Corrects errors from short reads sequences and uses homopolymers from short reads to replace those in long reads. 

Long reads are used as templates to recruit short (seed) reads through alignment. The set of seeds is extended by searching, using shared k-mers, for similar reads in the initial read set.

A micro-assembly of the reads is performed. Resulting contigs are aligned back to the input long reads, and a path in the contig graph is used as the corrected read. 

Alignment of short reads to long reads, and multi-alignment of the short reads recruited to one region.

A consensus is derived from the multiple alignment. 

Alignment of short reads to long reads and consensus.

 

Uses a specific scoring system to adapt the mapping to the high error rates. 

(B) Non-hybrid tools 
  Canu  daccord  LoRMA  MECAT  pbdagcon 
Reference  [31 [32 [33 [34 [35
Context  DNA  DNA  DNA  DNA  DNA 
Technology  PacBio or ONT  PacBio  PacBio or ONT  PacBio or ONT  PacBio 
Main algorithmic idea 

First, all-versus-all read overlap and alignment filtering. Directed acyclic graph is built from the alignments, which produce quasi-multiple alignments. Highest-weight path search in these graph yield consensus sequences that are used for correction. 

首先,全对全读重叠和对齐滤波。
由这些对准构造有向无环图,产生拟多重对准。
这些图中的最高权值路径搜索产生用于校正的一致序列。

Reads are compared pairwisely to obtain alignment piles. Several overlapping windows of hundreds of nucleotides are derived from these piles, on which micro-assembly (DBG with very small k-mers) is performed. A heaviest path in each DBG is heuristically selected as the consensus to correct the given window. 

对读数进行配对比较,得到对齐桩。
数百个核苷酸的几个重叠窗口从这些堆中得到,在这些堆上进行微组装(带有非常小的k-mers的DBG)。
每个DBG中的最重路径被启发式地选择为一致意见,以纠正给定的窗口。

Path search in dBG built from long reads. Multi-iterations over k-mer size for graph construction. The same framework than LoRDEC is used to correct read regions.  k-mer-based read matching, pairwise alignment between matched reads, alignment-based consensus calling on ’easy’ regions, local consensus calling (partial order graph) otherwise.  Align long reads to ‘backbone’ sequences, correction by iterative directed acyclic graph consensus calling from the multiple sequence alignments. 

Other works have evaluated error correction tools in the context of DNA sequencing.

LRCstats [45], and more recently ELECTOR [46], provide automated evaluations of genomic long read correction using a simulated framework. A technical report from [47] provides an extensive evaluation of PacBio/Nanopore error correction tools, in the context of de novo assembly. This analysis is completed with more recent results in [48] on hybrid correction methods. Perhaps the closest work to ours is the AlignQC software [21], which provides a set of metrics for the evaluation of RNA-sequencing long-read dataset quality. In [21] a comparison is provided between Nanopore and PacBio RNA-sequencing datasets in terms of error patterns, isoform identification and quantification. While [21] did not compare error correction tools, we will use and extend AlignQC metrics for that purpose.

In this article, we will focus on the qualitative and quantitative measurements of Nanopore error-corrected long reads, with transcriptomic features in mind. First we examine basic metrics of error correction, e.g. mean length, base accuracy, homopolymers errors and performance (running time, memory) of the tools. Then we ask several questions that are specific to transcriptome applications: (i) How is the number of detected genes, and more precisely the number of genes within a gene family, impacted by read error correction? (ii) Can error correction significantly change the number of reads mapping to genes or transcripts, possibly affecting downstream analysis based on these metrics? (iii) Do error correction tools perturb isoform diversity, e.g. by having a correction bias toward the major isoform? (iv) What is the impact of error correction on identifying splice sites? To answer these questions, we provide an automatic framework (LC_EC_analyser, Methods) for the evaluation of transcriptomic error correction methods, that we apply to 11 different error correction tools.

其他的工作是在DNA测序的背景下评估错误修正工具。
LRCstats[45]和最近的ELECTOR[46]使用模拟框架提供基因组长读校正的自动评估。
[47]提供了一份技术报告,在从头组装的背景下,对PacBio/Nanopore误差修正工具进行了广泛的评估。
这一分析是通过混合校正方法[48]中最新的结果完成的。
也许与我们最接近的工作是AlignQC软件[21],它提供了一套衡量rna测序长时间读取数据集质量的指标。
在[21]中,提供了纳米孔和PacBio rna测序数据集在错误模式、亚型识别和定量方面的比较。
虽然[21]没有比较纠错工具,但我们将使用并扩展AlignQC指标。

在这篇文章中,我们将集中在定性和定量测量的纳米孔错误纠正长读,铭记转录组特征。
首先,我们检查误差校正的基本指标,例如平均长度、基础精度、均聚物误差工具的性能(运行时间、内存)
然后我们问了几个针对转录组应用的问题:

(i)检测到的基因的数量,更准确地说是一个基因家族中的基因数量,是如何受到读错校正的影响的?
(ii)纠错是否会显著改变定位到基因或转录本的reads数量,从而可能影响基于这些指标的下游分析?
(iii)纠错工具是否会干扰异型的多样性,例如对主要异型的纠错偏向?
(四)纠错对拼接位点识别的影响是什么?
为了回答这些问题,我们提供了一个用于评估转录纠错方法的自动框架(LC_EC_analyser, Methods),我们将其应用于11种不同的纠错工具。

Results

Error correction tools

Table 1 presents the main characteristics of the hybrid and non-hybrid error correction tools that were considered in this study. For the sake of reproducibility, in the Supplementary Data Section S1 are described all the versions, dependencies and parameters. Note that these error correction tools were all tailored for DNA-seq data except for LSC.

The output of each error correction method can be classified into one of the four following types: full-length, trimmed, split and micro-assembly. Usually, due to methodological reasons, extremities of long reads are harder to correct. As an example, hybrid correctors based on mapping short to long reads, and calling a consensus from the mapping, have difficulties aligning short reads to the extremities of long reads. As such, some methods output ‘trimmed error-corrected reads’, i.e. error-corrected reads such that their uncorrected ends are removed. Examples of methods producing this type of output considered in this study are HALC, LoRDEC, LSC, proovread, daccord and pbdagcon. Sometimes, internal parts of long reads can also be hard to correct, due to a lack of coverage of short reads, or a drop of sequencing quality, or due to mapping issues. Some algorithms thus output ‘split error-corrected reads’, splitting one long read into several well-corrected fragments, such as HALC, LoRDEC, PBcR and LoRMA. Finally, some tools decide to not trim nor split the original reads, outputting ‘full-length error-corrected reads’. Examples include HALC, LoRDEC, LSC, proovread, canu, daccord, MECAT and pbdagcon. NaS does not fit the previous three categories, as it uses a micro-assembly strategy, instead of a classical polishing of the consensus, in which the long read is used as a template to recruit Illumina reads and, by performing a local assembly, build a high-quality synthetic read. As can be noted, some tools produce more than one type of output, sometimes three types. In the following sections, we add the suffixes ‘(t)’ and ‘(s)’ to a tool name to denote its trimmed and split outputs, respectively. We further add the suffix ‘(μμ⁠)’ to NaS, as a reminder that it is based on a micro-assembly strategy. Outputs that have no suffixes are considered full-length corrections. For example, HALC denotes the HALC full-length error-corrected reads, HALC(t), the HALC trimmed output and HALC(s), the HALC split output. As we will see, there is no type of output that outperforms all the others in all metrics. Choosing the appropriate type of output is heavily dependent on the application.

纠错工具
表1给出了本研究中考虑的混合和非混合误差修正工具的主要特征。
为了便于再现,在补充数据部分S1中描述了所有的版本、依赖关系和参数。
请注意,除了LSC之外,这些纠错工具都是针对dna测序数据定制的。

每种误差校正方法的输出可以分为四种类型:全长型、修剪型、分裂型和微装配型。
通常,由于方法上的原因,长读的极端部分更难纠正。
例如,基于短读到长读映射和从映射中调用一致意见的混合校正器很难将短读对齐到长读的末端。
因此,一些方法输出“修剪过的错误校正读取”,也就是说,错误校正读取的未校正的端被删除。
本研究中考虑的产生这种类型输出的方法有HALC、LoRDEC、LSC、proovread、daccord和pbdagcon。
有时,长读的内部部分也很难纠正,因为缺少对短读的覆盖,或测序质量下降,或由于映射问题。
一些算法因此输出“分裂错误校正读取”,将一个长读分裂成几个良好校正的片段,如HALC, LoRDEC, PBcR和LoRMA。
最后,一些工具决定不修剪或分割原始读取,输出“完整长度的错误修正读取”。
例如HALC, LoRDEC, LSC, proovread, canu, daccord, MECAT和pbdagcon。
NaS不符合前三种类型,因为它使用微装配策略,而不是对共识的经典抛光,即使用长读作为模板来招募Illumina读,并通过执行局部装配,构建高质量的合成读。
可以注意到,一些工具会产生不止一种类型的输出,有时是三种类型。
在下面的部分中,我们将在工具名中添加后缀' (t) '和' (s) ',分别表示其裁剪和分割输出。
我们进一步添加后缀”(⁠μμ⁠)的NaS,提醒人们,它是基于一个micro-assembly策略。
没有后缀的输出被认为是全长度校正。
例如,HALC表示HALC完整长度的错误纠正读取,HALC(t)表示HALC修剪输出,HALC(s)表示HALC分割输出。
正如我们将看到的,没有一种输出类型在所有指标上都优于其他所有类型。
选择适当的输出类型在很大程度上取决于应用程序。

Evaluation datasets

Our main evaluation dataset consists of a single 1D Nanopore run using the cDNA preparation kit of RNA material taken from a mouse brain, containing 740 776 long reads. An additional Illumina dataset containing 58 million paired end 151 bp reads was generated on the same sample but using a different cDNA protocol. For more details on the sequencing protocol, see Methods section. The Nanopore and Illumina reads from the mouse RNA sample are available in the ENA repository under the following study: PRJEB25574. In this paper, we provide a detailed analysis of this dataset, from Error-correction improves base accuracyand splits, trims, or entirely removes reads section to Usinga different read aligner mildly but not significantly affects theevaluation section. 

评估数据集
我们的主要评估数据集包括一个单一的1D纳米孔运行使用cDNA制备试剂盒RNA材料从小鼠大脑,包含740 776长读。
另外一个Illumina数据集包含5800万对末端151 bp reads,是在同一样本上产生的,但使用了不同的cDNA协议。
有关序列协议的更多细节,请参见方法部分。
从小鼠RNA样本中提取的纳米孔和Illumina reads可在ENA知识库中用于以下研究:PRJEB25574。
在本文中,我们对该数据集进行了详细的分析,从纠错改进基础精度、分割、修剪或完全删除读取部分到使用一个不同的读取对准器轻度但不显著地影响评估部分。

Table 2

Statistics of error correction tools on the 1D run RNA-seq dataset. To facilitate the readability of this table and the next ones, we highlighted values that we deemed satisfactory in green colour, borderline in brown and unsatisfactory in red, noting that such classification is somewhat arbitrary.

(A) Hybrid tools
 RawHALCHALC(t)HALC(s)LoRDECLoRDEC(t)LoRDEC(s)LSCLSC(t)NaS(μ)PBcR(s)ProovreadProovread(t)
nb reads  741 k  741 k  709 k  914 k  741 k  677 k  1388 k  619 k  619 k  619 k  1321 k  738 k  626 k 
mapped reads  83.5%  88.1%  95.6%  98.8%  85.5%  95.5%  97.5%  97.1%  97.6%  98.7%  99.2%  85.5%  98.9% 
mean length  2011  2174  1926  1378  2097  1953  816  2212  1901  1931  776  2117  1796 
nb bases  1313 M  1469 M  1334 M  1245 M  1394 M  1289 M  1106 M  1332 M  1151 M  1179 M  1015 M  1400 M  1112 M 
mapped basesa  89.0%  90.3%  96.6%  99.2%  90.6%  95.9%  99.1%  90.9%  97.7%  97.5%  99.2%  92.4%  99.5% 
error rateb  13.72%  1.85%  1.32%  0.44%  4.5%  3.73%  1.59%  5.45%  4.36%  0.38%  0.68%  2.65%  0.33% 
nb detected genes  16.8 k  17.0 k  16.8 k  16.6 k  16.8 k  16.6 k  16.5 k  16.5 k  16.2 k  15.0 k  15.6 k  16.6 k  14.6 k 
(B) Non-hybrid tools 
  Raw  Canu  daccord  daccord(t)  LoRMA(s)  MECAT  pbdagcon  pbdagcon(t) 
nb reads  741 k  519 k  675 k  840 k  1540 k  495 k  778 k  775 k 
mapped reads  83.5%  99.1%  92.5%  94.0%  99.4%  99.4%  98.2%  97.9% 
mean length  2011  2192  2002  1476  497  1992  1473  1484 
nb of bases  1313 M  1125 M  1350 M  1212 M  760 M  980 M  1136 M  1141 M 
mapped basesa  89.0%  92.0%  92.5%  94.7%  99.2%  96.9%  97.0%  96.7% 
error rateb  13.72%  6.43%  5.2%  4.12%  2.91%  4.57%  5.65%  5.71% 
nb detected genes  16.8 k  12.4 k  15.5 k  13.9 k  6.8 k  10.4 k  13.2 k  13.2 k 

aAs reported by AlignQC. Percentage of bases aligned among mapped reads, taken by counting the M parts of CIGAR strings in the BAM file. Bases in unmapped reads are not counted. bAs reported by AlignQC, using a sample of 1 million bases from aligned reads segments.

In order to obtain a more comprehensive understanding of the evaluated tools, we further analysed the correction of the methods on one human Nanopore direct RNA-sequencing data from the Nanopore-WGS-Consortium (dataset from centre Bham, run#1, sample type RNA, kit SQK-RNA001, pore R9.4, available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-transcriptome/fastq_fast5_bulk.md). We concatenated the fail and pass RNA-direct reads from the aforementioned dataset, obtaining 894 289 reads. Further, to correctly run all tools, we transformed bases U into T.

Error correction improves base accuracy and splits, trims or entirely removes reads

Table 2 shows an evaluation of error correction based on AlignQC results, for the hybrid and non-hybrid tools. The error rate is 13.72% in raw reads, 0.33–5.45% for reads corrected using hybrid methods and 2.91–6.43% with self-correctors. Notably, the hybrid tools NaS(μμ⁠), Proovread(t) and HALC(s) output micro-assembled, trimmed and split error-corrected reads, respectively, with the lowest error rates (<0.5%). We observe that HALC produced the full-length error-corrected reads with the lowest error rate (1.85%), but that is still significantly higher than the error rate of the three aforementioned methods. This is expected, as micro-assembling, trimming or splitting reads usually do not retain badly corrected regions of the reads, lowering the error rate. LoRMA(s), which is the only split self-correction tool, was the one that decreased the error rate the most among non-hybrid tools, but still just managed to reach 2.91%, one order of magnitude higher than the best hybrid correctors. If we look at non-split outputs among the self-correctors, MECAT and daccord(t) obtained the lowest error rates for full-length and trimmed error-corrected reads, respectively, but still presenting an error rate higher than 4%. It is not a surprise that the best error correctors are hybrid when looking at the error rates, given their usage of additional high-quality Illumina reads. As expected, trimming and splitting error-corrected reads reduces a lot the error rate, e.g. LoRDEC reduced the error rate from 4.5% to 3.73% by trimming, and to 1.59% by splitting. As such, the split output consistently outperformed trimmed and full-length outputs, regarding the error rate. A detailed error rate analysis will be carried in Detailed error-rate analysis section.

In terms of throughput after the correction step, tools that do not trim nor split reads tend to return a number of reads similar to that of the uncorrected (raw) reads. Notably, HALC and LoRDEC returned exactly the same number of reads, and Proovread returned just 3k less reads. On the other hand, Canu and MECAT decreased a lot the number of output reads, probably due to post-filtering procedures. Moreover, many of full-length outputs (HALC, LoRDEC, LSC, Proovread and daccord) increased the mean length of the raw reads while also increasing the number of output bases, showing that they tend to further extend the information contained in the long reads.

Trimming almost always decreased the number of output reads, like in HALC(t), LoRDEC(t), Proovread(t) and pbdagcon(t), probably due to post-filtering procedures. However, in LSC(t), trimming has no effect on the number of reads, and in daccord(t), trimming actually increased the number of reads. In half of the trimmed outputs (HALC(t), LoRDEC(t) and LSC(t)), the mean length of the reads was usually preserved, decreasing only by around 100 bps. However, in the other half (Proovread(t), daccord(t), and pbdagcon(t)), read lengths were, on average, reduced by 200–500 bp.

Splitting reads significantly increased the number of output reads, as expected. LoRDEC(s), PBcR(s) and LoRMA(s) tend to split reads into two or more shorter reads during the correction step, as they return ∼2×× more reads after correction that are also shorter (mean length of, respectively, 816, 776 and 497bp, versus 2011 bp in raw reads) and overall have significantly less bases in total (loss of, respectively. 207, 298 and 553 Mbp). HALC(s), on the other hand, managed to increase the number of reads only by 23%, with no significant loss of bases, but still with a significant reduction on the mean length (1378 bp). We observe that NaS(μμ⁠), based on micro-assembly, obtained a mean length similar to the trimmed outputs (HALC(t), LoRDEC(t) and LSC(t)). This suggests that NaS(μμ⁠) has trouble either getting seed short reads or recruiting short reads mapping to the ends of long reads, or assembling the reads mapped to the ends.

These observations indicate that care should be taken when considering which type of output should be used. For example, all split and half of the trimmed outputs should not be used in applications trying to describe the full transcript structure, or distant exons coupling, as the long read connectivity is lost in many cases in these types of outputs.

Overall, no correction tool outperforms all the others across the metrics analysed in this section. However, hybrid correctors are systematically better than self-correctors at decreasing the error rate (and preserving the transcriptome diversity, as we will discuss in the next section). Trimming and splitting usually increase the read accuracy (and also mapping rate, as we see next), but decrease the total amount of bases in the read set and the mean read length, which can lead to loss of long-range information that was present in the raw reads.

错误校正提高基础精度,并分割、修剪或完全删除读取
表2显示了基于AlignQC结果的混合和非混合工具的错误校正评估。
原始读取的错误率为13.72%,混合方法校正的错误率为0.33 5.45%,自校正的错误率为2.91 6.43%。
值得注意的是,混合工具NaS(liu liu)、Proovread(t)和HALC(s)分别输出微组装、修剪和分裂的错误纠正读出,错误率最低(0.5%)。
我们观察到HALC产生的全长度纠错读取的错误率最低(1.85%),但仍显著高于上述三种方法的错误率。
这是可以预料到的,因为微组装、微调或拆分读取通常不会保留读取中错误修正的区域,从而降低误差率。
唯一的分割自校正工具LoRMA(s)是非混合工具中降低错误率最高的,但仍仅达到2.91%,比最好的混合校正器高出一个数量级。
如果我们观察自校正器之间的非分裂输出,MECAT和daccord(t)分别获得了全长和修剪错误校正读数的最低错误率,但错误率仍高于4%。
从错误率来看,最好的错误校正器是混合的,这并不奇怪,因为它们使用了额外的高质量Illumina读取。
正如预期的那样,修剪和拆分错误校正读取大大降低了错误率,例如LoRDEC通过修剪将错误率从4.5%降低到3.73%,通过拆分将错误率降低到1.59%。
因此,就错误率而言,分割输出始终优于修剪和全长输出。
详细的错误率分析将在错误率分析部分进行。
就校正步骤之后的吞吐量而言,不修剪或分割读取的工具往往会返回与未校正(原始)读取类似的大量读取。
值得注意的是,HALC和LoRDEC返回的读取数完全相同,而Proovread只返回少3k的读取数。
另一方面,Canu和MECAT减少了大量的输出读取,可能是由于后滤波过程。
此外,许多全长输出(HALC、LoRDEC、LSC、Proovread和daccord)增加了原始读取的平均长度,同时也增加了输出基的数量,表明它们倾向于进一步扩展长读取中包含的信息。
修剪几乎总是减少输出读取的数量,如HALC(t)、LoRDEC(t)、Proovread(t)和pbdagcon(t),可能是由于后滤波过程。
然而,在LSC(t)中,修剪对读取的数量没有影响,而在daccord(t)中,修剪实际上增加了读取的数量。
在一半的剪切输出(HALC(t)、LoRDEC(t)和LSC(t))中,读取的平均长度通常保持不变,仅减少约100 bp。
然而,在另一半(Proovread(t)、daccord(t)和pbdagcon(t))中,读取长度平均减少了200 500 bp。
正如预期的那样,分割读取显著增加了输出读取的数量。
LoRDEC (s), PBcR (s)和LoRMA (s)倾向于读分割成两个或两个以上的短读校正步骤期间,当他们返回两个读取校正后,时间很短(分别指的是长度,816、776和497个基点,而在原始读取2011个基点)和整体大大减少基地总(分别损失。
207、298和553兆比特)。
另一方面,HALC(s)只增加了23%的读取次数,没有明显的碱基损失,但仍然显著减少了平均长度(1378 bp)。
我们观察到,基于微组装的NaS(倾斜度)获得的平均长度与修剪输出相似(HALC(t)、LoRDEC(t)和LSC(t))。
这表明NaS在获取种子短读或将短读映射到长读的末端或将映射到末端的reads组合上都有困难。
这些观察结果表明,在考虑应该使用哪种类型的输出时应该谨慎。
例如,所有的分裂输出和一半的修剪输出不应该用于试图描述完整的转录结构或远外显子耦合的应用程序,因为在这些类型的输出中,长读连接在很多情况下会丢失。
总体而言,在本部分分析的指标中,没有任何修正工具优于所有其他工具。
然而,在降低错误率(和保存转录组多样性,我们将在下一节讨论)方面,混合校正器系统上比自校正器更好。
修剪和分割通常会提高读取精度(也会提高映射率,正如我们接下来看到的),但会减少读取集中的碱基总数和平均读取长度,这会导致原始读取中存在的远程信息的丢失。

Error correction facilitates mapping yet generally lowers the number of detected genes

Apart from HALC, LoRDEC, Proovread and daccord, for which only 85–92% of reads were mapped, corrected reads from all the other tools were mapped at a rate of 94–99%, showing a significant improvement over raw reads (mapping rate of 83.5%). We observe that these four tools with the lowest percentages of mapped reads had high mean read length, indicating that trimming, splitting or discarding reads seems necessary in order to obtain shorter but overall less error-prone reads. In general in all tools (except pbdagcon), trimming and splitting increased the proportion of mapped reads and bases, sometimes significantly (e.g. Proovread). However some tools that do not offer the option to trim or split reads, such as Canu and MECAT, showed very high mapping rate with high mean read length and error rate. This is related to their aggressive post-filtering measure, which removed a significant portion of the reads (29–33%).

On verifying if error correctors are able to preserve transcriptome diversity, we can see a striking difference between hybrid and self-correctors; in general, hybrid correctors present a far higher number of detected genes than the self ones. Interestingly, HALC was able to even increase the number of detected genes by 221 with regard to the raw reads, indicating that some genes were maybe not detected before due to imperfect mapping caused by the high error rate. We also found that 72 genes were detected in the raw reads but not in any of the error-corrected outputs. Furthermore, 354 genes are absent from the results of nearly all correction methods (≥16 out of 19).

Overall, all hybrid tools presented a satisfactory amount of detected genes, except for NaS(μμ⁠), PBcR(s) and Proovread (t), while self-correctors did not present any satisfactory results, with Canu, LoRMA(s) and MECAT reducing by 35–59% the number of detected genes reported in raw reads. We can also note that trimming and splitting systematically resulted in a loss of the sensitivity to detect new genes. Moreover, except for HALC(s), tools with very high percentage of mapped reads (NaS(μμ⁠), PBcR(s), Proovread(t), Canu, LoRMA(s), MECAT, pbdagcon and pbdagcon(t)) had the largest losses in number of detected genes, hinting that error correction can reduce gene diversity in favour of lower error rate, and/or that clusters of similar genes (e.g. paralogous) are corrected toward a single gene. Therefore, if preserving the transcriptome diversity is required for the downstream application, self-correctors should be avoided altogether, along with some hybrid correctors (NaS(μμ⁠), PBcR(s) and Proovread(t)).

纠错有助于定位,但通常会降低被检测基因的数量
除了HALC、LoRDEC、Proovread和daccord(只有85 92%的读取被映射)之外,其他工具校正的读取被映射的比率为94 99%,比原始读取(映射率为83.5%)有了显著提高。
我们观察到,这四种具有最低映射读取百分比的工具具有较高的平均读取长度,这表明为了获得更短但总体上不容易出错的读取,修剪、拆分或丢弃读取似乎是必要的。
一般来说,在所有的工具中(除了pbdagcon),修剪和分割增加了映射读和基的比例,有时显著(例如Proovread)。
然而,一些不提供修剪或分割读取选项的工具,如Canu和MECAT,显示出非常高的映射率,具有很高的平均读取长度和错误率。
这与他们激进的后过滤措施有关,后者删除了相当一部分读(29 33%)。
在验证错误校正器是否能够保持转录组多样性时,我们可以看到杂交和自校正器之间的显著差异;
一般来说,杂交校正者比自身校正者所检测到的基因数量要多得多。
有趣的是,HALC甚至能够将原始reads检测到的基因数量增加221个,这说明由于高错误率导致的绘图不完善,一些基因可能之前没有被检测到。
我们还发现,在原始读取中检测到72个基因,但在任何修正错误的输出中都没有检测到。
此外,几乎所有校正方法的结果中都缺少354个基因(19个中的16个)。
总的来说,所有杂交工具都显示出了令人满意的检测基因数量,除了NaS(爆炸声)、PBcR(s)和Proovread (t),而自校正器则没有任何令人满意的结果,其中Canu、LoRMA(s)和MECAT将原始reads中报告的检测基因数量减少了35 59%。
我们还可以注意到系统的修剪和分裂导致了检测新基因的敏感性的丧失。
此外,除了HALC (s)、工具具有很高比例的映射读取(NaS(μμ),PBcR (s), Proovread (t) Canu, LoRMA (s), MECAT, pbdagcon和pbdagcon (t))已发现的基因数量最大的损失,暗示纠错可以减少基因多样性的降低错误率,和/或集群相似的基因(如paralogous)是对单个基因修正。
因此,如果下游应用需要保留转录组多样性,则应该完全避免使用自校正器,以及一些混合校正器(NaS(发文)、PBcR(s)和Proovread(t))。

Detailed error rate analysis

The high error rate of transcriptomic long reads significantly complicates their primary analysis [49]. While Error-correction improves base accuracy and splits, trims, or entirelyremoves reads section presented a general per-base error rate, this section breaks down sequencing errors into several types and examines how each error correction tool deals with them. A general, and expected, trend that we find in all tools and in all types of errors is that trimming and splitting the reads result in less substitutions, deletions and insertion errors. We will therefore focus in other aspects in this analysis. The data presented here is a compilation of AlignQC results. Note that AlignQC computed the following metrics only on reads that could be aligned; thus unaligned reads are not counted, yet they may possibly be the most erroneous ones. AlignQC also subsampled aligned reads to around 1 million bases to calculate the presented values.

详细错误率分析
高错误率的转录长读显著复杂化他们的初步分析[49]。
当错误校正提高基础精度和分割,修剪,或完全删除读取部分提出了一个一般的每个基础错误率,本节将排序错误分解为几种类型,并检查每个错误校正工具如何处理它们。
我们在所有工具和所有类型的错误中发现的一个普遍的和预期的趋势是,修剪和拆分读取导致更少的替换、删除和插入错误。
因此,在本分析中,我们将着重于其他方面。
这里提供的数据是AlignQC结果的汇编。
请注意,AlignQC仅在可以对齐的读取上计算以下指标;
因此,未对齐读取不被计算,但它们可能是最错误的。
AlignQC也下采样对齐读到大约100万个碱基来计算呈现的值。

Deletions are the most problematic sequencing errors

Table 3 shows the error rate in the raw reads and in the corrected reads for each tool. In raw reads, deletions are the most prevalent type of errors (7.41% of bases), closely followed by subsitutions (5.11%), then insertions (1.2%). LoRDEC, LSC and LSC(t) are the least capable of correcting mismatches (>2% of them remaining), even though they are all hybrid tools. For LoRDEC, we were able to verify that this is related to the large amount of uncorrected reads in its output (90k totally uncorrected reads out of 741 k - 12%), as computed by exactly matching raw reads to its corrected output. For LSC and LSC(t), we were unable to pinpoint a reason. The majority of other hybrid tools (HALC, HALC(t), HALC(s), NaS(μμ⁠), PBcR(s), Proovread and Proovread(t)) result in less than 1% of substitution errors. Surprisingly, the non-hybrid tools also presented very low mismatches rates; all of them showed rates lower than 1%, except for Canu (1.33%) and daccord (1.1%). This suggests that the rate of systematic substitution errors in ONT data is low, as self-correctors were able to achieve results comparable to the hybrid ones, even without access to Illumina reads. Still, the three best performing tools were all hybrid (NaS(μμ⁠), PBcR(s) and Proovread(t)), which should therefore be preferred for applications that require very low mismatch rates.

删除是最有问题的排序错误。

表3显示了每个工具的原始读取和修正读取中的错误率。
在原始读取中,删除是最常见的错误类型(占7.41%的碱基),紧接着是替换(5.11%),然后是插入(1.2%)。
虽然LoRDEC、LSC和LSC(t)都是混合工具,但它们纠正不匹配的能力最差(剩下的2%)。
对于LoRDEC,我们能够验证这与它的输出中大量未校正的读取有关(从741 k中完全未校正的读取90k - 12%),这是通过将原始读取与校正后的输出精确匹配来计算的。
对于LSC和LSC(t),我们无法查明原因。
大多数其他混合工具(HALC、HALC(t)、HALC(s)、NaS(爆裂爆裂)、PBcR(s)、Proovread和Proovread(t))导致的替代误差小于1%。
令人惊讶的是,非混合工具也呈现出非常低的错配率;
除Canu(1.33%)和daccord(1.1%)外,其余均低于1%。
这表明,在ONT数据中,系统替代错误的比率很低,因为即使没有Illumina reads,自校正器也能获得与混合校正器相当的结果。
尽管如此,三种性能最好的工具都是hybrid (NaS, PBcR,和Proovread, t),因此对于需要非常低失配率的应用程序,它们应该是首选。

  
Table 3

Error rate in the raw reads and in the corrected reads for each tool, on the 1D run RNA-seq dataset, computed from 1M random aligned bases.

(A) Hybrid tools
 RawHALCHALC(t)HALC(s)LoRDECLoRDEC(t)LoRDEC(s)LSCLSC(t)NaS(μμ⁠)PBcR(s)ProovreadProovread(t)
Error rate  13.72%  1.85%  1.32%  0.44%  4.5%  3.73%  1.59%  5.45%  4.36%  0.38%  0.68%  2.65%  0.33% 
Mismatch  5.11%  0.79%  0.54%  0.22%  2.04%  1.76%  1.13%  2.35%  2.01%  0.2%  0.18%  0.93%  0.13% 
Deletion  7.41%  0.85%  0.64%  0.17%  2.15%  1.73%  0.39%  2.64%  1.94%  0.09%  0.3%  1.51%  0.18% 
Insertion  1.2%  0.21%  0.14%  0.05%  0.32%  0.24%  0.07%  0.47%  0.4%  0.08%  0.19%  0.22%  0.03% 
(B) Non-hybrid tools 
  Raw  Canu  daccord  daccord(t)  LoRMA(s)  MECAT  pbdagcon  pbdagcon(t) 
Error rate  13.72%  6.43%  5.2%  4.12%  2.91%  4.57%  5.65%  5.71% 
Mismatch  5.11%  1.33%  1.1%  0.67%  0.37%  0.33%  0.49%  0.49% 
Deletion  7.40%  4.82%  3.82%  3.27%  2.51%  4.18%  5.06%  5.17% 
Insertion  1.20%  0.28%  0.28%  0.19%  0.04%  0.06%  0.09%  0.05% 

The contrast between self and hybrid tools is more visible on deletion errors. In general, all hybrid tools outperformed the non-hybrid ones (the only exception is LSC (2.64%), with higher deletion error rate than LoRMA(s) (2.51%)). Although in the hybrid ones, LoRDEC (2.15%), LSC (2.64%), LSC(t) (1.94%) and Proovread (1.51%) still showed moderate rates of deletions, all the other seven corrected outputs were able to lower the deletion error rate from 7.4% to less than 1%. Notably, HALC(s) and Proovread(t) to less than 0.2%, and NaS(μμ⁠) to less than 0.1%. All non-hybrid tools presented a high rate (3% or more) of deletion errors, except LoRMA(s) (2.51%). This comparison suggests that ONT reads exhibit systematic deletions, which cannot be well corrected without the help of Illumina data. The contribution of homopolymer errors will be specifically analysed in Homopolymer insertions are overall better corrected thandeletions section. Considering insertion errors, all tools performed equally well. It is worth noting that several hybrid (HALC(s), LoRDEC(s), NaS(μμ⁠) and Proovread(t)) and non-hybrid tools (LoRMA(s), MECAT, pbdagcon and pbdagcon(t)) achieved sub-0.1% insertion rate errors.

Overall, hybrid tools outperformed non-hybrid ones in terms of error rate reduction. However, the similar results obtained by both types of tools when correcting mismatches and insertions, and the contrast in correcting deletions, seem to indicate that the main advantage of hybrid correctors over self-correctors is the removal of systematic errors using Illumina data.

Homopolymer insertions are overall better corrected than deletions

In this section we further analyse homopolymers indels, i.e. insertion or deletion errors consisting of a stretch of the same nucleotide. Table 4 shows that homopolymer deletions are an order of magnitude more abundant in raw reads than homopolymer insertions. It is worth noting that, by comparing the values for the raw reads in Tables 3 and 4, homopolymers are involved in 40% of all deletions and 31% of all insertions.

 
Table 4

Homopolymer error rate in the raw reads and in the corrected reads for each tool, on the 1D run RNA-seq dataset, computed from 1M random aligned bases.

(A) Hybrid tools
 RawHALCHALC(t)HALC(s)LoRDECLoRDEC(t)LoRDEC(s)LSCLSC(t)NaS(μμ⁠)PBcR(s)ProovreadProovread(t)
Homop. deletion  2.96%  0.28%  0.19%  0.03%  0.77%  0.63%  0.19%  0.62%  0.42%  0.02%  0.1%  0.46%  0.04% 
Homop. insertion  0.38%  0.05%  0.03%  0.01%  0.09%  0.07%  0.02%  0.11%  0.09%  0.01%  0.02%  0.06%  0.01% 
(B) Non-hybrid tools 
  Raw  Canu  daccord  daccord(t)  LoRMA(s)  MECAT  pbdagcon  pbdagcon(t) 
Homop. deletion  2.96%  2.46%  2.14%  2.05%  1.82%  2.09%  2.26%  2.27% 
Homop. insertion  0.38%  0.08%  0.06%  0.03%  0.01%  0.01%  0.02%  0.01% 

A closer look at Table 4 reveals that hybrid error correctors outperform non-hybrid ones, as expected, mainly as homopolymer indels are likely systematic errors in ONT sequencing. Hybrid correctors correct them using Illumina reads that do not contain such biases. Moreover, all tools performed well on correcting homopolymer insertions, reducing the rate from 0.38% to less than 0.11%. In particular, the hybrid tools HALC(s), NaS(μμ⁠) and Prooovread(t), as well as the non-hybrid ones LoRMA(s), MECAT and pbdagcon(t) reached 0.01% homopolymer insertion error rate. Regarding homopolymer deletions, the majority of hybrid tools returned less than 0.5% of them, except LoRDEC (0.77%), LoRDEC(t) (0.63%) and LSC (0.62%). Notably, HALC(s), NaS(μμ⁠) and Proovread(t) presented less than 0.05% of homopolymer deletion error rate. Non-hybrid tools performed more pooly, returning 1.8–2.4% of homopolymers deletion errors—a small improvement over the raw reads.

HALC(s), NaS(μμ⁠) and Proovread(t) showed the best reduction of homopolymers indels. It is also worth noting that hybrid correctors are able to correct homopolymer deletions even better than non-homopolymer deletions. For instance the ratio of homopolymer deletions over all deletions is 39.9% in raw reads, and decreases for all hybrid correctors, except LoRDEC(s), dropping to 17.6% for HALC(s), and 22.2% for NaS(μμ⁠) and Proovread(t), but increases to at least 43.9% (pbdagcon(t)) up to 72.5% (LoRMA(s)) in non-hybrid tools (see Supplementary Data Section S2).

Error correction perturbs the number of reads mapping to the genes and transcripts

Downstream RNA-sequencing analyses typically rely on the number of reads mapping to each gene and transcript for quantification, differential expression analysis, etc. In the rest of the paper, we define the coverage of a gene or a transcript as the number of reads mapping to it. For short we will refer to those coverages as CGCG and CTCT⁠, respectively. In this section we investigate if the process of error correction can perturb CGCG and CTCT⁠, which in turn would affect downstream analysis. Note that error correction could potentially slightly increase coverage, as uncorrected reads that were unmapped can become mappable after correction. Figure 1 shows the CGCG before and after correction for each tool.

Figure 1

Number of reads mapping to genes (CGCG⁠) before and after correction for each tool. The genes taken into account here were expressed in either the raw dataset or after the correction by the given tool.

 
Number of reads mapping to genes ($C_G$) before and after correction for each tool. The genes taken into account here were expressed in either the raw dataset or after the correction by the given tool.
 
 

We can note, as expected, that splitting a long read into several well-corrected fragments generates multiple counts, skewing up the observed coverages (see HALC(s), LoRDEC(s), PBcR(s) and LoRMA(s) in Figure 1). Therefore, users are recommended to not use this type of output for gene coverage estimation. Further, apart from the split outputs, all the other tools presented good correlation and the expected slight increase in CGCG due to better mapping, except for MECAT, which presented the lowest correlation and a significant drop in CGCG⁠.

All tools systematically presented a similar trend and lower correlation values on CTCT (see Supplementary Data Section S3), in comparison to CGCG⁠. This is expected, as it is harder for a tool to correct a read into its true isoform than into its true gene. The behaviour of the tools in the isoform level is in coherence with their behaviour in the gene level (CGCG⁠): split outputs inflate CTCT⁠; MECAT deflates it; and all the others present a slight increase.

Error correction perturbs gene family sizes

Table 2 indicates that error correction generally results in a lower number of detected genes. In this section we explore the impact of error correction on paralogous genes. By paralogous gene family, we denote a set of paralogs computed from Ensembl (see Reference-based evaluation of long read error correction). Figure 2 represents the changes in sizes of paralogous gene families before and after correction for each tool, in terms of number of genes expressed within a given family. Overall, error correctors do not strictly preserve the sizes of gene families. Correction more often shrinks families of paralogous genes than it expands them, likely due to erroneous correction in locations that are different between paralogs. In summary, 36–87% of gene families are kept of the same size by correctors, 1–17% are expanded and 6–61% are shrunk. Supplementary Data Figure S2 shows the magnitude of expansion/shrinkage for each gene family.

Figure 2

Summary of gene family size changes across error correction tools.

 
Summary of gene family size changes across error correction tools.
 
 

Error correction perturbs isoform diversity

We further investigated whether error correction introduces a bias toward the major isoform of each gene. Note that AlignQC does not directly address this question. To answer it, we computed the following metrics: number of isoforms detected in each gene before and after correction by alignment of reads to genes, coverage of lost isoforms in genes having at least two expressed isoforms and coverage of the major isoform before and after correction.

The number of isoforms varies before and after correction

Figure 3 shows the number of genes that have the same number of isoforms after correction, or a different number of isoforms (3,2,1,+1,+2,+3)(−3,−2,−1,+1,+2,+3)⁠. In this figure, only the genes that are expressed in both the raw and the corrected reads (for each tool) are taken into consideration. The negative (resp. positive) values indicate that isoforms were lost (resp. gained). We observe that a considerable number of genes (∼1.9k for LoRDEC(s), LSC and PBcR(s), and ∼5.4k for MECAT) lose at least one isoform in all tools, which suggests that current methods reduce isoform diversity during correction. NaS(μμ⁠), Proovread(t), Canu and MECAT tend to lose isoforms the most, and HALC(s), LoRDEC(s) and PBcR(s) identify the highest number of new isoforms after correction. It is however unclear whether these lost and new isoforms are real (present in the sample) or due to mapping ambiguity, as these three latter tools split corrected reads into shorter sequences that may map better to other isoforms. We observe that the effect of trimming, on the other hand, is generally slight. Overall, the number of isoforms is mostly unchanged in LoRDEC, LoRDEC(t) and LSC.

Figure 3

Histogram of genes having more or less isoforms after error correction.

 
Histogram of genes having more or less isoforms after error correction.
 
 
Figure 4

Histogram of isoforms that are lost after correction, in relation to their relative transcript coverage, in genes that have two or more isoforms. The yy-axis reflects the percentage of isoforms lost in each bin. Absolute values can be found in the Supplementary Data Figure S3.

 
Histogram of isoforms that are lost after correction, in relation to their relative transcript coverage, in genes that have two or more isoforms. The $y$-axis reflects the percentage of isoforms lost in each bin. Absolute values can be found in the Supplementary Data Figure S3.
 
 
Figure 5

Coverage of the major isoform of each gene before and after error correction. The xx-axis reflects the number of reads mapping to the major isoform of a gene before correction, and the yy-axis is after correction. Blue line, regression; black line, x=y.

 
Coverage of the major isoform of each gene before and after error correction. The $x$-axis reflects the number of reads mapping to the major isoform of a gene before correction, and the $y$-axis is after correction. Blue line, regression; black line, x=y.
 
 

Multi-isoform genes tend to lose lowly expressed isoforms after correction

Figure 4 explores the relative coverage of isoforms that were possibly lost after correction, in genes having two or more expressed isoforms. The relative coverage of a transcript is the number of raw reads mapping to it over the number of raw reads mapping to its gene in total. Only the genes that are expressed in both the raw and the error-corrected reads (for each tool) are taken into consideration here. We anticipated that raw reads that map to a minor isoform are typically either discarded by the corrector, or modified in such a way that they now map to a different isoform, possibly the major one. The effect is indeed relatively similar across all correctors, except for MECAT, which tends to remove a higher fraction of minor isoforms, and LoRDEC and LSC, which tend to be the most conservatives. We can also note that trimming and splitting reads increase even further isoform losses in all tools, except for pbdagcon. This can be explained by the fact that lowly expressed isoforms possibly share regions (e.g. common exons) with highly expressed isoforms, and these shared regions are usually better corrected than regions that are unique to the lowly expressed isoforms. If read splitting then takes places, such unique regions will then be removed from the output. Even if there are variations between a highly expressed isoform II and a lowly expressed isoform ii⁠, if these variations are relatively small (e.g. a small exon skipping) and are flanked by long shared regions, it is probable that the methods will truncate the variation, correcting the unique fragment of ii into II and potentially losing the signal that ii is expressed (this is explored in details in Correction towards the major isoform is more prevalent when the alternative exon is small section). This result suggests that current error correction tools overall do not conservatively handle reads that belong to low-expression isoforms.

Minor isoforms are corrected toward major isoforms

We define a major isoform of a gene as the isoform with the highest coverage of that gene in the raw dataset, all other isoforms are considered to be minor. To follow-up on the previous subsection, we investigate whether correctors tend to correct minor isoforms toward major isoforms. We do so by comparing the difference of coverage of the major and the minor isoforms before and after correction. In Figure 5, we observe that the coverage of the major isoform generally slightly increases after correction. The exceptions are tools that split reads (HALC(s), LoRDEC(s), PBcR(s) and LoRMA(s)), where the coverage is increased significantly, and MECAT, where the coverage decreases significantly, likely due to a feature of MECAT’s own correction algorithm. Since these 5 tools seem to heavily distort the coverage of isoforms due to aggressive splitting or filtering steps, we will focus now on the 14 other results. The slight increase of a transcript coverage after correction is expected, as already discussed in Error-correction perturbs the number of reads mapping to the genesand transcripts section; uncorrected reads that were unmapped can become mappable after correction. Therefore, the effect presented in Figure 5 could be simply due to reads being corrected to their original respective isoforms, instead of correction inducing a switch from a minor isoform to the major isoform. To verify this hypothesis, Supplementary Data Figure S4 shows that the coverage of the minor isoforms usually ‘decreases’ after correction (R2[0.5;0.8]R2∈[0.5;0.8]⁠), except for (i) tools that split reads (HALC(s), LoRDEC(s), PBcR(s) and LoRMA(s)), which skews even more the coverage of minor isoforms and (ii) both HALC and HALC(t). This indicates that error correction tools tend to correct reads toward the major isoforms. It is worth noting that the increase of the coverage of the major isoform is not pronounced. This is expected, as the sum of the expression of the minor isoforms is, by nature, a small fraction of the total gene expression. On the other hand, the correlation of the coverage of the minor isoforms before and after correction are far more spurious, suggesting a stronger effect. It is noteworthy that correction biases with respect to the major isoform do not appear to be specific to self-correctors nor to hybrid correctors, but an effect that happens in both types of correctors.

Figure 6

Mapping of simulated raw and error-corrected reads to two simulated isoforms, and measurements of the percentage of reads mapping to the major isoform. The two isoforms represent an alternatively skipped exon of variable size 10, 50 and 100 bp. Left: isoform structure conservation using 100×× short reads coverage and 10X long reads, using three error correction programs, one per row: LoRDEC, PBcR and proovread. Right: same with three self-correctors and 100×× long reads: daccord, LoRMA and pbdagcon. Columns are alternative exon sizes. Bars are plots for each isoform ratio (50%; 75% and 90%) on the xx-axis. On the yy-axis, the closer a bar is to its corresponding ratio value on the xx⁠, the better. For instance, the bottom left light blue bar corresponds to a 50% isoform ratio with an exon of size 10, and we do not retrieve a 50% ratio after correction with Proovread (the bar does not go up to 50% on the vertical axis, but around 75% instead). The same layout applies to the right plot, where self-correctors are presented.

 
Mapping of simulated raw and error-corrected reads to two simulated isoforms, and measurements of the percentage of reads mapping to the major isoform. The two isoforms represent an alternatively skipped exon of variable size 10, 50 and 100 bp. Left: isoform structure conservation using 100$\times $ short reads coverage and 10X long reads, using three error correction programs, one per row: LoRDEC, PBcR and proovread. Right: same with three self-correctors and 100$\times $ long reads: daccord, LoRMA and pbdagcon. Columns are alternative exon sizes. Bars are plots for each isoform ratio (50%; 75% and 90%) on the $x$-axis. On the $y$-axis, the closer a bar is to its corresponding ratio value on the $x$, the better. For instance, the bottom left light blue bar corresponds to a 50% isoform ratio with an exon of size 10, and we do not retrieve a 50% ratio after correction with Proovread (the bar does not go up to 50% on the vertical axis, but around 75% instead). The same layout applies to the right plot, where self-correctors are presented.
 
 

Correction towards the major isoform is more prevalent when the alternative exon is small

In order to observe if particular features of alternative splicing have an impact on error correction methods, we designed a simulation over two controlled parameters: skipped exon length and isoform relative expression ratio. Using a single gene, we created a mixture of two simulated alternative transcripts: one constitutive, one exon-skipping. Several simulated read datasets were created with various relative abundances between major and minor isoform (in order to model a local differential in splicing isoform expression) and sizes of the skipped exon. Due to the artificial nature and small size of the datasets, many of the error correction methods could not be run. We thus tested these scenarii on a subset of the correction methods.

In Figure 6, we distinguish results from hybrid and self-correctors, presented with, respectively, 100×× coverage of short reads and 100×× coverage of long reads, and only 100×× coverage of long reads. Results on more shallow coverage (10××⁠) and impact of simulation parameters on corrected reads sizes are presented in Supplementary Data Sections S7 and S8. Overall, hybrid correctors are less impacted by isoform collapsing than self-correctors. LoRDEC shows the best capacity to preserve isoforms in presence of alternatively skipped exons. Thus, regardless of the abundance of inclusion reads in the dataset to be corrected, 99% of reads from inclusion are corrected to inclusion form for an exon size of 10, and 100% of reads from inclusion are corrected to inclusion form for exon sizes of 50 and 100. However with less coverage, e.g. due to low-expressed genes and rare transcripts, all tools tend to mis-estimate the expression of isoforms (see Supplementary Data Sections S7 and S8). Self-correctors generally have a minimum coverage threshold (only daccord could be run on the 10×× coverage dataset of long reads, with rather erratic results, see Supplementary Data Section S8). Even with higher coverage, not all correctors achieve to correct this simple instance. Among all correctors, only LoRDEC seems to report the expected number of each isoforms consistently in all scenarii. We could not derive any clear trend concerning the relative isoform ratios, even if the 90% ratio seems to be in favour of overcorrection toward the major isoform. Skipped exon length seems to impact both hybrid and self-correctors, small exons being a harder challenge for correctors.

Error correction affects splice site detection

The identification of splice sites from RNA-seq data is an important but challenging task [50]. When mapping reads to a (possibly annotated) reference genome, mapping algorithms typically guide spliced alignments using either a custom scoring function that takes into account common splices sites patterns (e.g. GT-AG), and/or a database of known junctions. With long reads, the high error rate makes precise splice site detection even more challenging, as indels (see Detailed error-rate analysis section) confuse aligners, shifting predicted spliced alignments away from the true splice sites.

Figure 7

Statistics on the correctly and incorrectly mapped splice sites (abbreviated SSs) for the uncorrected (raw) and corrected reads.

 
Statistics on the correctly and incorrectly mapped splice sites (abbreviated SSs) for the uncorrected (raw) and corrected reads.
 
 
 
              
                           
                           
 
 
Table 5

Running time and memory usage of error correction tools on the 1D run RNA-seq dataset

 HALCLoRDECLSCNaS11PBcR22ProovreadCanudaccord33daccord trimmed33LoRMA44MECATpbdagcon33pbdagcon33 trimmed
Running time  3.6 h  2.4 h  99.5 h  63.2 h  116 h  107.1 h  0.7 h  6.9 h (7.4 h)  6.6 h (7.1 h)  3.4 h  0.3 h  5.7 h (6.2 h)  5.6h (6.1h) 
Memory usage  60 GB  5.6 GB  2.7 GB  3 GB  166.5 GB  53.6 GB  2.2 GB  6.9 GB (27.2 GB)  6.8 GB (27.2 GB)  79 GB  9.9 GB  6.4 GB (27.2 GB)  6.4 GB (27.2 GB) 

11NaS was ran in batches on a different system (TGCC cluster) than other tools; total running time was estimated based on subset of batches. 22PBcR was ran on a machine different from the others. 33daccord and pbdagcon need DAZZ_DB and DALIGNER to be ran before performing their correction. DAZZ_DB execution time and memory usage was disregarded due to being negligible. DALIGNER, however, took 0.5 h and 27.2 Gb of RAM. The runtime in parenthesis denotes the runtime of the tool plus DALIGNER. The memory usage in parenthesis denotes the maximum memory usage between the tool and DALIGNER. 44LoRMA was using more than its allocated 32 cores in some (short) periods of time during the run.

In this section, we evaluate how well splice sites are detected before and after error correction. Figure 7 shows the number of correctly and incorrectly mapped splice sites for the raw and corrected reads, as computed by AlignQC. One would expect that a splice site is correctly detected when little to no errors are present in reads mapping around it. Thus, as expected, the hybrid error correction tools present a clear advantage over the non-hybrid ones, as they better decrease the per-base error rate. In the uncorrected reads, 27% of the splice sites were incorrectly mapped, which is brought down to less than 1.2% in eight hybrid corrected outputs: HALC, HALC(t), HALC(s), LoRDEC(s), NaS(μμ⁠), PBcR(s), Proovread and Proovread(t). Notably, Proovread(t) presented only 0.28% incorrectly mapped splice sites. LoRDEC (2.43%) and LoRDEC(t) (2.12%) presented higher rates, but still manageable, but LSC (7.27%) and LSC(t) (5.68%) underperformed among the hybrid correctors. Among self-correction tools, LoRMA presented the lowest proportion of incorrectly detected splice sites (3.04%); however it detects ∼6.7 times less splice sites (∼280 k) than the raw reads (∼1.9 M), due to read splitting. The other non-hybrid tools incorrectly detected splice sites at a rate between 5.61% (daccord(t)) and 11.95% (Canu). It is worth noting that trimming usually decreased the proportion of incorrectly mapped splice sites, with a very slight impact on the total amount of identified splice sites. On the other hand, the three tools with lowest number of identified splice sites output split reads (LoRDEC(s), PBcR(s) and LoRMA(s)), identifying less than ∼1.1 M splicing sites, compared to the ∼1.9 M in the raw reads, and thus not being adequate for splice sites analyses. Additional detailed plots on incorrectly mapped splice sites can be found in the Supplementary Data Section S9.

Running time and memory usage of error correction tools

Table 5 shows the running time and memory usage of all evaluated tools, measured using GNU time. The running time shown is the elapsed wall clock time (in hours) and the memory usage is the maximum resident set size (in gigabytes). All tools were ran with 32 threads. Overall, all tools were able to correct the dataset within 7 hours, except for LSC, NaS, PBcR and Proovread, which took 63–116 hours, but also achieved some of the lowest post-correction error rates in Table 2 (except for LSC). In terms of memory usage, all tools required less than 10 GB of memory except for HALC, PBcR, Proovread and LoRMA, which required 53–166 GB. It is worth noting, however, that hybrid error correctors have to process massive Illumina datasets, which contributes to them taking higher CPU and memory usage for correction.

运行时间和内存使用的纠错工具
表5显示了使用GNU时间度量的所有评估工具的运行时间和内存使用情况。
显示的运行时间是经过的挂钟时间(以小时为单位),内存使用情况是驻留的最大设置大小(以gb为单位)。
所有工具都以32个线程运行。
总的来说,除了LSC、NaS、PBcR和Proovread(耗时63-116小时)之外,所有工具都能够在7小时内对数据集进行校正,但在表2中也实现了一些最低的校正后错误率(LSC除外)。
在内存使用方面,除了HALC、PBcR、Proovread和LoRMA,所有的工具都需要小于10 GB的内存,它们需要53-166 GB。
然而,值得注意的是,混合错误校正器必须处理大量的Illumina数据集,这使得它们需要更高的CPU和内存来进行校正。

Using a different read aligner mildly but not significantly affects the evaluation

We chose GMAP (version 2017-05-08 with parameters -n 10) [51] to perform long reads mapping to the Ensembl r87 Mus Musculus unmasked reference genome in our analysis, since [49] shows it produces the best alignment results between five alignment tools. The GMAP parameters are those from the original AlignQC publication [15]. However, Minimap2 [52] is not evaluated in [49], and it is also widely used, being the default long-read mapper in several studies. In this subsection, we verify to which extent the differences between GMAP and Minimap2 can influence our evaluation. To try to highlight such differences, we chose some correctors with the worst and best performances in some measures presented in the previous analysis (made with GMAP). We thus further mapped with Minimap2 (version 2.14-r883 with parameters -ax splice) the following read datasets: (i) raw reads; (ii) LoRDEC, the corrector with the least proportion of mapped reads, but which preserves well the transcriptome diversity; (iii) LoRMA(s), the tool with the highest ratio of mapped reads and number of reads, but with the lowest mean reads length, number of detected genes and number of bases; (iv) Proovread(t), the method with the lowest error rate, and highest rate of correctly mapped splice sites; and (v) Canu, the corrector with the highest error rate and lowest rate of correctly mapped splice sites.

 使用不同的读取对准器会轻微但不显著地影响评估

在我们的分析中,我们选择了GMAP(版本2017-05-08,参数为-n 10)[51]来执行对Ensembl r87 Mus Musculus未掩体参考基因组的长读映射,因为[49]显示它在五种比对工具中产生了最好的比对结果。
GMAP参数来自原始的AlignQC发布[15]。
然而,Minimap2[52]在[49]中没有被评估,它也被广泛使用,在一些研究中作为默认的长读映射器。
在本小节中,我们将验证GMAP和Minimap2之间的差异会在多大程度上影响我们的评估。
为了突出这些差异,我们在前面的分析中(使用GMAP)选择了一些在某些度量中表现最差和最好的校正器。
因此,我们进一步用Minimap2 (2.14-r883版本的参数-ax splice)映射以下读取数据集:

(i)原始读取;
(ii) LoRDEC校正器,其映射读数比例最小,但能很好地保持转录组多样性;
(iii) LoRMA(s),图读比和读数最高,但平均读长、检测基因数和碱基数最低;
(iv) Proovread(t),错误率最低,拼接位点正确率最高的方法;
(v) Canu,具有最高错误率和最低正确映射剪接位点率的校正器。

Table 6

GMAP and Minimap2 results across a selection of metrics for the raw, LoRDEC, LoRMA(s), Proovread(t), and Canu reads. (G) indicates GMAP results, and (M2) Minimap2 results.

Metricraw(G)raw(M2)LoRDEC(G)LoRDEC(M2)LoRMA(s)(G)LoRMA(s)(M2)Proovread(t)(G)Proovread(t)(M2)Canu(G)Canu(M2)
mapped reads  83.5%  86.8%  85.5%  87.4%  98.9%  99.8%  99.4%  99.1%  99.1%  100.0%d
mapped basesa 89.0%  91.0%  90.6%  92.6%  99.5%  99.8%  99.2%  99.3%  92.0%  93.2% 
error rateb 14.11%  13.79%  4.67%  4.65%  0.33%  0.31%  2.9%  2.73%  6.2%  5.97% 
mismatches  5.26%  3.98%  2.15%  1.89%  0.15%  0.11%  0.37%  0.22%  1.22%  0.85% 
deletions  7.58%  8.09%  2.17%  2.29%  0.14%  0.17%  2.51%  2.46%  4.72%  4.79% 
insertions  1.26%  1.73%  0.34%  0.47%  0.04%  0.03%  0.03%  0.05%  0.26%  0.32% 
nb detected genes  16.8 k  16.2 k  16.8 k  16.0 k  14.6 k  14.2 k  6.8 k  6.3 k  12.4 k  11.7 k 
correctly mapped ssc 72.97%  91.46%  97.56%  98.69%  99.71%  99.58%  96.92%  98.07%  88.03%  94.67% 

aaAs reported by AlignQC. Percentage of bases aligned among mapped reads, taken by counting the M parts of CIGAR strings in the BAM file. Bases in unmapped reads are not counted. bbAs reported by AlignQC, using a sample of 1 million bases from aligned reads segments. ccss stands for splice sites. dd100.0% is obtained due to rounding. A more precise mapped reads rate is 99.988%.

据AlignQC报道。
映射读之间对齐的碱基百分比,通过计算BAM文件中雪茄串的M个部分获得。
未映射读取中的碱基不计算。
由AlignQC报告的bbAs,使用来自对齐reads片段的100万个碱基的样本。
ccss代表拼接位点。
dd100.0%由于四舍五入得到。
更精确的映射读取率为99.988%。

We present in Table 6 the main differences between Minimap2 and GMAP on the aforementioned correctors. We can note that Minimap2 is able to map more reads than GMAP across all tools but Proovread(t). However, when both mappers are able to map almost all reads (≥98.9%), the mapping rate difference between them is low (<1%). The largest differences is when both mappers do not perform well, mapping less than 90% of the reads. In this case, Minimap2 does a better job, mapping 3.3% and 1.9% more reads than GMAP in the raw and LoRDEC reads, respectively, which is noticeable, but not high. A similar conclusion can be derived by looking at the mapped bases rate. The reported error rate is very similar between both mappers, with the largest difference being 0.32% in the raw reads. There is a noticeable difference, however, when we break down the errors by types. GMAP reported more mismatches, while Minimap2 reported more deletions and insertions. Nevertheless, the differences are only noticeable in the raw reads. The measure in which GMAP clearly outperformed Minimap2 is the number of detected genes, identifying between 453 (Proovread(t)) and 723 (LoRDEC) more genes, which can be considered significant. On the other hand, Minimap2 was considerably better than GMAP on mapping splice sites correctly. The difference between both mappers when they performed well, mapping more than 96.9% of the splice sites, was not significant (≤1.15%). The largest discrepancies can be seen on mapping the splice sites of the raw and Canu reads, with Minimap2 correctly aligning 18.49% and 6.64% more splice sites than GMAP.

We can thus conclude that Minimap2 was slightly better on mapping reads and bases, and significantly better at correctly mapping splice sites. On the other hand, GMAP was considerably better at detecting genes. The error rate of the reads mapped by both tools are very similar, with GMAP reporting slightly more mismatches, and Minimap2 reporting slightly more deletions and insertions. However, these differences are mainly concentrated on the raw reads dataset, which is the less accurate.

The data shown and discussed in this section suggest that the main takeaway points from the results presented in the previous sections should not change when GMAP is replaced by Minimap2. The full report of the analysis of both mappers, containing further details, can be found in the support page of our method: https://leoisl.gitlab.io/LR_EC_analyser_support/.

在表6中,我们介绍了上述校正器中Minimap2和GMAP之间的主要区别。
我们可以注意到,除了Proovread(t)之外,Minimap2能够在所有工具上比GMAP映射更多的读取。
然而,当这两个映射器能够地图几乎所有读取(⁠≥≥98.9%),它们之间的映射速度差异较低(& lt; 1%)。
最大的区别是当两个映射器都不能很好地执行时,映射不到90%的读取。
在这种情况下,Minimap2做得更好,在raw和LoRDEC读取中分别比GMAP多映射3.3%和1.9%,这是显而易见的,但不是很高。
通过查看映射的基数率可以得出类似的结论。
两个映射器之间报告的错误率非常相似,在原始读取方面的最大差异是0.32%。
然而,当我们按类型细分错误时,有一个明显的区别。
GMAP报告了更多的不匹配,而Minimap2报告了更多的删除和插入。
然而,只有在原始的解读中,差异才明显。
GMAP明显优于Minimap2的指标是检测到的基因数量,确定了453个(Proovread(t))和723个(LoRDEC)之间的基因,这可以被认为是显著的。
另一方面,在正确映射拼接站点方面,Minimap2要比GMAP好得多。
这两个映射器的区别表现良好时,映射超过96.9%的接头地点,不显著(⁠≤≤1.15%)。
最大的差异可以在raw和Canu读取的拼接位点的映射上看到,Minimap2比GMAP正确对齐的拼接位点多18.49%和6.64%。

因此,我们可以得出结论,Minimap2在映射读取和碱基方面稍好一些,在正确映射剪接位点方面明显更好。
另一方面,GMAP在检测基因方面做得更好。
这两个工具映射的读的错误率非常相似,GMAP报告的不匹配稍微多一些,Minimap2报告的删除和插入稍微多一些。
然而,这些差异主要集中在原始的read数据集上,这是较不准确的。

本节中显示和讨论的数据表明,当将GMAP替换为Minimap2时,前几节中给出的结果的主要要点不应该改变。
两个映射器分析的完整报告(包含进一步的详细信息)可以在我们方法的支持页面中找到:https://leoisl.gitlab.io/LR_EC_analyser_support/。

Analysing human Nanopore direct RNA-sequencing data

We further analysed a human Nanopore ‘direct RNA’-sequencing dataset from the Nanopore-WGS-Consortium (see Section Evaluation datasets for details). Since there was no corresponding Illumina sequencing for this dataset, we were able to evaluate only the non-hybrid error correction tools. Moreover, although LoRMA could be successfully executed, AlignQC could not process its output so we removed LoRMA from the evaluation. Table 7 presents some main statistics of non-hybrid error correction tools on the aforementioned dataset. In the rest of this section, we highlight the major differences and similarities between cDNA and direct RNA datasets, keeping in mind that they were performed on two different species (mouse and human, respectively).

分析人类纳米孔直接RNA测序数据
我们进一步分析了来自Nanopore- wgs - consortium的人类纳米孔“直接RNA”测序数据集(详见评估数据集部分)。
由于该数据集没有相应的Illumina测序,我们只能评估非混合误差修正工具。
此外,虽然LoRMA可以成功执行,但AlignQC无法处理其输出,因此我们将LoRMA从评估中删除。
表7给出了关于上述数据集的非混合误差修正工具的一些主要统计数据。
在本节的其余部分,我们强调cDNA和直接RNA数据集之间的主要差异和相似之处,记住他们是在两个不同的物种(小鼠和人类,分别)上进行的。

Table 7

Statistics of non-hybrid error correction tools on one human Nanopore direct RNA-sequencing data from the Nanopore-WGS-Consortium

 RawCanudaccorddaccord(t)MECATpbdagconpbdagcon(t)
nb reads  894 k  533 k  744 k  792 k  224 k  772 k  771 k 
mapped reads  76.0%  99.2%  97.5%  98.2%  99.2%  98.6%  98.5% 
mean length  739  917  849  755  1095  666  675 
nb of bases  587 M  484 M  620 M  591 M  243 M  510 M  516 M 
error ratea 4.61%  7.61%  6.03%  5.47%  6.36%  8.1%  8.26% 
mismatches  5.79%  2.12%  1.56%  1.3%  0.7%  1.19%  1.18% 
deletions  5.95%  4.77%  4.13%  3.9%  5.5%  6.79%  6.94% 
insertions  2.87%  0.72%  0.34%  0.27%  0.17%  0.12%  0.14% 
homop. deletions  2.18%  2.01%  1.84%  1.81%  2.24%  2.18%  2.22% 
homop. insertions  1.13%  0.3%  0.14%  0.11%  0.09%  0.06%  0.07% 
nb detected genes  14.1 k  8.8 k  12.3 k  10.8 k  6.3 k  10.0 k  10.1 k 
correctly mapped ssb 76.95%  87.97%  92.52%  93.94%  92.86%  91.16%  91.14% 

aa   As reported by AlignQC, using a sample of 1 million bases from aligned reads segments. bbss stands for splice sites.

In general, self-correctors discarded more reads on the direct RNA dataset than on the 1D cDNA dataset. Daccord(t) discarded the least number of reads (102 k), while Canu and MECAT discarded a considerable amount of reads (361 k and 670 k, respectively), due to post-correction filtering. Due to our choice of parameters, the shortest reads in Canu and MECAT outputs were of lengths 101 and 100 bases, respectively. However, the removal of shorter reads explains only a fraction of the read losses, as the raw dataset contains only ∼59 k reads smaller than 101 bases. GMAP mapped a smaller percentage of raw reads in the direct RNA dataset (76% versus 83.5% for the 1D cDNA dataset), but on the other hand, the mapping rate of error-corrected reads was generally higher. Notably, 97.5% of the daccord reads (resp. 98.2% for daccord(t)) were mapped, as opposed to 92.5% (resp. 94%) in the 1D cDNA dataset. The mean length of the corrected reads when compared to the mean length of the raw reads was also in general higher in the direct RNA dataset, which translated into tools having a number of output bases more similar to the number of bases in the raw reads in this dataset, except for MECAT.

The error rate of the raw direct RNA reads was 14.61%, 0.89% higher than in the raw 1D cDNA reads. As expected, the error rates in all tools were also higher in the direct RNA dataset, leading to worse results. The largest difference is with pbdagcon(t), where the error rate after correction of direct RNA reads is 2.55% higher than on the 1D cDNA dataset. The distribution of errors in the direct RNA dataset was more balanced, with mismatches and deletions having almost the same representation (around 5.85%), but insertions still being less represented (2.87%). The correction behaviour of the tools is similar across both datasets: insertion is the best corrected type of error, followed by mismatches, with satisfactory results, and deletions, in which the methods overall did a poor job. In particular, pbdagcon and pbdagcon(t) even ‘increased’ the deletion error rate by 0.84% and 0.99%, respectively. The behaviour was similar on correcting homopolymer errors; homopolymer deletions were poorly corrected, with MECAT, pbdagcon and pbdagcon(t) not reducing the homopolymer deletion error rate at all, while homopolymer insertions were well corrected. The number of detected genes in the raw direct RNA dataset (14.1 k) is less than in the raw 1D cDNA dataset (16.8 k), although this is consistent with the difference in human/mouse genes count. Moreover, the tools also lose more genes in the direct RNA dataset. In particular, MECAT loses 6.4 k genes in the 1D cDNA dataset, and 7.8 k in the direct RNA dataset. The rate of correctly mapped splice sites was slightly higher in the raw direct RNA reads (76.95% versus 72.94%), but in the error-corrected reads, this rate was highly similar (the largest difference was 0.44% in the daccord(t) correction).

As a result of our evaluation, and in accordance with the cDNA analysis, care should be taken when applying self-correctors to remove errors from Nanopore direct RNA-seq data. For example, Canu and MECAT present the undesirable side effect of discarding a lot of input reads, thus reducing the amount of information in the long reads, and decreasing considerably the number of detected genes. Although the tools perform well at correcting mismatches and insertions, they have trouble correcting deletions. In particular, MECAT, pbdagcon and pbdagcon(t) perform rather poorly, with the last two even increasing the deletion error rate. Nonetheless, all tools manage to decrease the error rate so that the large majority of reads are mappable, increasing the proportion of mapping from 76% in the raw reads to at least 97.5%. If very low error rate (<1%) must be achieved due to requirements of a downstream application, then it seems that using hybrid correction tools is a must.

The full report of the analysis output by our method on this dataset, containing further details, can be found at https://leoisl.gitlab.io/LR_EC_analyser_support/.

通常,自校正器在直接RNA数据集上丢弃的读数比在1D cDNA数据集上丢弃的读数多。
Daccord(t)丢弃的读数最少(102 k),而Canu和MECAT由于后校正滤波,丢弃了相当多的读数(分别为361 k和670 k)。
由于我们对参数的选择,Canu和MECAT输出的最短读取分别为101和100个碱基。
然而,删除较短读取只解释了读取损失的一小部分,因为原始数据集只包含~ ~ ~ 59k读取,小于101个碱基。
在直接RNA数据集中,GMAP映射的原始读区比例较小(为76%,而1D cDNA数据集中为83.5%),但另一方面,错误修正读区映射率普遍较高。
值得注意的是,97.5%的daccord读数(resp)。
daccord(t)的比例为98.2%,而非92.5% (resp)。
(94%)在1D cDNA数据集中。
的平均长度修正读取原始的平均长度相比,阅读也是一般高于直接RNA数据集,它翻译成工具有一个数量的输出基地更类似于原始阅读基地的数量在这个数据集,MECAT除外。

原始直接RNA reads的错误率为14.61%,比原始1D cDNA reads的错误率高0.89%。
正如预期的那样,在直接RNA数据集中,所有工具的错误率也更高,从而导致更糟糕的结果。
最大的不同是与pbdagcon(t),其中直接RNA reads校正后的错误率比1D cDNA数据集的错误率高2.55%。
在直接RNA数据集中,错误的分布更加平衡,不匹配和缺失几乎有相同的表示(约5.85%),但插入仍然较少表示(2.87%)。
这些工具在两个数据集上的纠正行为是相似的:插入是纠正错误的最佳类型,其次是不匹配,得到满意的结果,以及删除,在这方面,这些方法总体上表现不佳。
特别地,pbdagcon和pbdagcon(t)甚至还分别“增加”了0.84%和0.99%的删除错误率。
在修正均聚物误差时,其行为相似;
均聚物缺失的纠正很差,MECAT、pbdagcon和pbdagcon(t)完全没有降低均聚物缺失错误率,而均聚物插入的纠正很好。
在原始的直接RNA数据集(14.1 k)中检测到的基因数量低于原始的1D cDNA数据集(16.8 k),尽管这与人/小鼠基因数量的差异是一致的。
此外,这些工具还会在直接的RNA数据集中丢失更多的基因。
特别是,MECAT在1D cDNA数据集中丢失了6.4 k基因,在direct RNA数据集中丢失了7.8 k基因。
原始直接RNA reads中剪接位点的正确映射率略高(76.95%对72.94%),但在错误纠正的reads中,这一比率高度相似(最大的差异是daccord(t)校正的0.44%)。

根据我们的评估,并与cDNA分析一致,在使用自校正器去除纳米孔直接rna测序数据中的误差时,应谨慎使用。
例如,Canu和mecu目前的不良副作用是丢弃大量的输入读取,从而减少了长读取中的信息量,大大减少了检测到的基因的数量。
尽管这些工具在纠正不匹配和插入方面表现良好,但它们在纠正删除方面存在困难。
特别是MECAT、pbdagcon和pbdagcon(t)的性能相当差,最后两个甚至增加了删除错误率。
尽管如此,所有的工具都设法降低了错误率,因此大多数读取都是可映射的,将原始读取中的映射比例从76%提高到至少97.5%。
如果由于下游应用程序的要求,必须实现非常低的错误率(1%),那么使用混合修正工具似乎是必须的。

我们的方法对该数据集的分析输出的完整报告(包含进一步的细节)可以在https://leoisl.gitlab.io/LR_EC_analyser_support/找到。

Discussion

This work shed light on the versatility of long-read DNA error correction methods, which can be successfully applied to error correction of Nanopore RNA-sequencing data as well. In our tests, error rates can be reduced from 13.7% in the original reads down to as low as 0.3% in the corrected reads. This is perhaps an unsurprising realization as the error correction of RNA-sequencing data presents similarities with DNA-sequencing; however a collection of caveats are described in the Results 2 section. Most importantly, the number of genes detected by alignment of corrected reads to the genome was reduced significantly by several error correction methods, mainly the non-hybrid ones. Furthermore, depending on the method, error correction results have a more or less pronounced bias toward correction to the major isoform for each gene, jointly with a loss of the most lowly expressed isoforms. We provided a software that enables automatic benchmarking of long-read RNA-sequencing error correction software, in the hope that future error correction methods will take advantage of it to avoid biases.

Detailed error rate analysis showed that while hybrid correctors have lower error rates than self-correctors, the latter achieved comparable performance to the former in correcting substitutions and insertions. Deletions appear to be caused by systematic sequencing errors of the Nanopore technology, making them fundamentally hard (or even impossible) to address in a self-correction setting. Moreover PBcR, NaS and Proovread are one of the most resource-intensive error correction tools, but also are some of the few correctors able to reduce the base error rate below 0.7%. The only exception to this is HALC, which presents a low running time, and <0.5% error rate in its split output.

We observe that hybrid correctors were able to better preserve the number of detected genes than self-correctors. The large majority of the hybrid corrections (9/12) were able to identify an amount of genes similar to the raw reads, with only NaS(μμ⁠), PBcR(s) and Proovread(t) being less sensitive, but still obtaining reasonable results. On the other hand, daccord was the only self-correction tool that reached the same gene identification level of the three aforementioned hybrid tools, while the others heavily truncated the transcriptome diversity. HALC, LoRDEC, LSC, Proovread (only in full-length mode) and daccord (only in full-length mode) appear to also better preserve the number of detected isoforms better than other correctors (Section The number of isoforms varies before and after correction). All tools tend to lose lowly expressed isoforms after correction (Section Multi-isoform genes tend to lose lowly expressedisoforms after correction. Several tools also tend to correct minor isoforms toward major isoforms (Section Minor isoformsare corrected toward major isoforms), mainly when the variation between them is small (Section Correction towardsthe major isoform is more prevalent when the alternative exonis small. These points are expected, as most tools were mainly tailored to process DNA data where heterogeneous coverage is not expected. Furthermore, hybrid correctors outperformed self-correctors in the correction of errors near splice site junctions (Section Error correction affects splice site detection).

As a result, we conclude that no evaluated corrector outperforms all the others across all metrics and is the most suited in all situations, and the choice should be guided by the downstream analysis, yet hybrid correction tools generally outperformed the self-correctors. For quantification, we have shown that error correction introduces undesirable coverage biases, as per Section Error correction perturbs the number of readsmapping to the genes and transcripts; therefore we would recommend avoiding this step altogether. For isoform detection, HALC, LoRDEC, LSC and Proovread (only in full-length mode) appear to be the methods of choice as they result in the highest number of detected genes in Table 2 and also preserve the number of detected isoforms as per Section Error correction perturbs isoform diversity. If Illumina reads are not available, then daccord (only in full-length mode) seems to be the best choice. For splice site detection, we recommend using hybrid correctors, preferably HALC, NaS, PBcR or Proovread, as per Section Error correction affects splice site detection. The same four tools (however, HALC should be in split mode, and Proovread, in trimmed mode) are also recommended if downstream analyses require very low general error rate. Finally for all other applications, we make some general recommendations. A reasonable balance appears to be achieved by tools with no unsatisfactory values in Table 2: HALC(t), NaS(μμ⁠) and Proovread(t). If Illumina reads are unavailable, then the best overall self-correctors seem to be daccord(t), pbdagcon and pbdagcon(t). Moreover, trimming and splitting usually increase the mapping rate and the read accuracy, but can lead to loss of information that was present in the raw reads, complicating the correct identification of genes.

Our analyses relied on a single mapping software (GMAP [51]) to align raw and error-corrected reads, as in previous benchmarks [2149]. However, we were also able to verify that Minimap2 [52], another widely used mapper, produces similar results than GMAP (see Section Using a different read alignermildly but not significantly affects the evaluation), and thus the main messages of the analyses presented in this paper should not change by replacing GMAP by Minimap2.

As a side note, AlignQC reports that raw reads contained 1% of chimeric reads, i.e. either portions of reads that align to different loci, or to overlapping loci. The number of chimeric reads after error correction remains in the 0.7–1.3% range except for LoRDEC(s) (0.2%), PBcR(s) (0.1%), Proovread(t) (0.1%), LoRMA(s) (0.04%) and MECAT (0.2%), which either correctly split reads or discarded chimeric ones. We observe that HALC (4.2%), HALC(t) (3.9%) and daccord (1.7%) increased considerably the proportion of chimeric reads in the output.

Furthermore, we focused our evaluation on a single technology: Nanopore. We did an extensive analysis of 1D cDNA Nanopore data, using Illumina data for hybrid correction. We also performed a brief analysis of Nanopore direct RNA-seq data. While it would be natural to also evaluate PacBio data, we note that data from the PacBio Iso-Seq protocol is of different nature as the reads are pre-corrected by circular consensus.

In the evaluation of tools, we did not record the disk space used by each method, yet we note that it may be a critical factor for some tools (e.g. Canu) on larger datasets. We note also that genes that have low Illumina coverage are unlikely to be well corrected by hybrid correctors. Therefore our comparison does not take into account differences in coverage biases between Illumina and Nanopore data, which may benefit self-correctors. Finally, transcript and gene coverages are derived from the number of long reads aligning to a certain gene/transcript. This method enables to directly relate the results of error correction to transcript/gene counts, but we note that in current RNA-seq analysis protocols, transcript/gene expression is still generally evaluated using short reads.

讨论
这项工作揭示了长时间读取的DNA纠错方法的通用性,这种方法也可以成功地应用于纳米孔rna测序数据的纠错。
在我们的测试中,错误率可以从最初读取的13.7%降低到修改后读取的0.3%。
这也许是一个不足为奇的认识,因为rna测序数据的错误校正表现出与dna测序的相似性;
然而,在结果2节中描述了一些注意事项。
最重要的是,通过几种错误校正方法(主要是非杂交方法),将校正后的reads对准基因组,检测到的基因数量显著减少。
此外,根据不同的方法,错误校正结果对每个基因的主要亚型的校正有或多或少的明显偏向,同时还会损失表达最低的亚型。
我们提供了一种软件,能够对长时间读取的rna测序纠错软件进行自动基准测试,希望未来的纠错方法能够利用它来避免偏差。

详细的错误率分析表明,虽然混合校正器的错误率比自校正器低,但自校正器在校正替换和插入方面的性能与混合校正器相当。
缺失似乎是由纳米孔技术的系统测序错误造成的,这使得在自校正设置中很难(甚至不可能)处理它们。
此外,PBcR、NaS和Proovread是最耗费资源的纠错工具之一,但也是少数能够将基本错误率降低到0.7%以下的纠错工具之一。
唯一的例外是HALC,它的运行时间很低,分割输出的错误率为0.5%。

我们观察到,杂交校正器比自校正器能够更好地保存检测到的基因数量。
混合动力修正的绝大多数(9/12)能够识别一个基因类似于原始的读取,只有NaS(⁠μμ⁠),PBcR (s)和Proovread (t)不太敏感,但仍获得合理的结果。
另一方面,daccord是唯一达到上述三种杂交工具相同基因鉴定水平的自校正工具,其他工具严重截断了转录组多样性。
HALC、LoRDEC、LSC、Proovread(仅在全长模式下)和daccord(仅在全长模式下)似乎也比其他校正器更好地保留了检测到的异型的数量(Section isoform的数量在校正前后有所不同)。
所有的工具在校正后往往会丢失低表达的亚型(切片多亚型基因在校正后往往会丢失低表达的亚型)。
一些工具也倾向于将小亚型更正为主要亚型(部分小亚型更正为主要亚型),主要是当它们之间的差异很小时(当可选外显子很小时,主要亚型的部分更正更为普遍)。
这些点是预期的,因为大多数工具主要是为处理非异构覆盖的DNA数据而定制的。
此外,混合校正器在校正接合点附近的误差方面优于自校正器(截面误差校正影响接合点检测)。

因此,我们得出结论,在所有指标中,没有一个被评估的校正工具优于所有其他校正工具,并且在所有情况下都是最适合的,并且选择应该以下游分析为指导,然而混合校正工具的表现通常优于自校正工具。对于量化,我们已经证明了错误校正引入了不希望的覆盖偏差,因为根据章节错误校正会干扰到基因和转录本的读写数量;因此,我们建议完全避免这一步骤。对于异构体检测,HALC、LoRDEC、LSC和Proovread(仅在全长模式下)似乎是选择的方法,因为它们导致表2中检测到的基因数量最多,并且根据错误校正干扰亚型多样性一节保留检测到的亚型的数量。如果Illumina reads不可用,那么daccord(仅在全长模式下)似乎是最佳选择。对于拼接位点检测,我们建议使用混合校正器,最好是HALC、NaS、PBcR或Proovread,因为纠错影响拼接位点检测。如果下游分析要求非常低的一般错误率,也建议使用相同的四种工具(但是,HALC应处于拆分模式,Proovread应处于修剪模式)。最后,对于所有其他应用,我们提出一些一般性建议。合理的平衡似乎是通过表2中没有不满意值的工具实现的:HALC(t)、NaS(μμ)和Proovread(t)。如果Illumina reads不可用,那么最好的整体自校正器似乎是daccord(t)、pbdagcon和pbdagcon(t)。此外,修剪和分割通常可以提高作图率和读取精度,但会导致原始读取中存在的信息丢失,使正确识别基因变得复杂。

我们的分析依赖于一个单一的映射软件(GMAP[51])来对齐原始和错误修正后的读数,就像以前的基准测试一样[21,49]。然而,我们也能够验证另一个广泛使用的映射器Minimap2[52]产生的结果与GMAP相似(参见使用不同的读取对齐方式,但不会显著影响评估结果),因此,本文中提出的分析的主要信息不应因Minimap2代替GMAP而改变。
作为旁注,AlignQC报告说原始阅读包含1%的嵌合阅读,也就是说,阅读的一部分与不同的基因座或重叠的基因座对齐。纠错后的嵌合体读取数保持在0.7-1.3%的范围内,除了LoRDEC(s)(0.2%)、PBcR(s)(0.1%)、Proovread(t)(0.1%)、LoRMA(s)(0.04%)和MECAT(0.2%)之外,它们要么正确地分割读取,要么丢弃嵌合读。我们观察到HALC(4.2%)、HALC(t)(3.9%)和daccord(1.7%)显著增加了输出中嵌合体读取的比例。
此外,我们的评估集中在一种技术上:纳米孔。我们使用Illumina数据进行杂交校正,对1D cDNA纳米孔数据进行了广泛的分析。我们还对纳米孔直接RNA序列数据进行了简要分析。虽然评估PacBio数据也是很自然的,但我们注意到PacBio Iso-Seq协议中的数据具有不同的性质,因为读取是通过循环共识预先修正的。
在评估工具时,我们没有记录每种方法所使用的磁盘空间,但我们注意到,对于某些工具(如Canu)而言,这可能是一个关键因素。我们还注意到,低照度覆盖率的基因不太可能被杂交校正剂很好地校正。因此,我们的比较没有考虑到Illumina和纳米孔数据之间的覆盖偏差差异,这可能有利于自校正。最后,转录本和基因覆盖率由与某个基因/转录本对齐的长读次数得出。这种方法可以直接将纠错结果与转录/基因计数联系起来,但我们注意到,在当前的RNA序列分析协议中,转录/基因表达仍然通常使用短读来评估。

Methods

Nanopore library preparation and sequencing

RNA MinION sequencing cDNA were prepared from four aliquots (250ng each) of mouse commercial total RNA (brain, Clontech, Cat# 636601), according to the Oxford Nanopore Technologies (Oxford Nanopore Technologies Ltd, Oxford, UK) protocol ‘1D cDNA by ligation (SQK-LSK108)’. The data generated by MinION software (MinKNOW 1.1.21, Metrichor 2.43.1) were stored and organized using a Hierarchical Data Format. FASTA reads were extracted from MinION HDF files using poretools [53]. We obtained 1 256 967 Nanopore 1D reads representing around 2 Gbp of data with an average size of 1650 bp and a N50 of 1885 bp. These reads were then filtered in silico to remove mtRNA and rRNA using BLAT [54] and est2genome [55], obtaining 740 776 long reads.

纳米孔库的制备和测序
根据牛津Nanopore Technologies (Oxford Nanopore Technologies Ltd, Oxford, Oxford, UK)的协议“1D cDNA ligation (SQK-LSK108)”,从4个样本(brain, Clontech, Cat# 636601)中制备RNA MinION测序cDNA。
由MinION软件(MinKNOW 1.1.21, Metrichor 2.43.1)生成的数据使用分层数据格式存储和组织。
FASTA读取是使用poretools[53]从MinION HDF文件中提取的。
我们获得了1 256 967个纳米孔1D reads,代表大约2 Gbp的数据,平均大小为1650 bp, N50为1885 bp。
然后使用BLAT[54]和est2genome[55]对这些reads进行硅过滤,去除mtRNA和rRNA,获得740 776个长reads。

Illumina library preparation and sequencing

RNA-Seq library preparations were carried out from 500 ng total RNA using the TruSeq Stranded mRNA kit (Illumina, San Diego, CA, USA), which allows mRNA strand orientation (sequence reads occur in the same orientation as anti-sense RNA). After quantification by qPCR, each library was sequenced using 151 bp paired end reads chemistry on a HiSeq4000 Illumina sequencer.

Illumina文库的准备和测序
使用TruSeq Stranded mRNA试剂盒(Illumina, San Diego, CA, USA)从500 ng总RNA中制备RNA- seq文库,该试剂盒允许mRNA的链定向(序列读值与反义RNA的方向相同)。
qPCR定量后,用HiSeq4000 Illumina测序仪对每个文库进行测序,测序长度为151 bp。

Reference-based evaluation of long read error correction

A tool coined LR_EC_analyser, available at https://gitlab.com/leoisl/LR_EC_analyser, was developed using the Python language to analyse the output of long reads error correctors. The required arguments are the BAM files of the raw and corrected reads aligned to a reference annotated genome, as well as the reference genome in Fasta file format and the reference annotation in GTF file format. A file specifying the paralogous gene families can also be provided if plots on gene families should be created. In our main analysis, gene families were computed by selecting all paralogs from Ensembl r87 mouse genes with 80%+ identity. Note that paralogs from the same family may have significantly different lengths, and no threshold was applied with respect to coverage. The complete selection procedure is reported here: https://gitlab.com/leoisl/LR_EC_analyser/blob/master/GettingParalogs.txt. The main processing of our method involves running the AlignQC software [21] (https://github.com/jason-weirather/AlignQC) on the input BAMs and parsing its output to create custom plots. It then aggregates information into a HTML report. For example, Tables 24 are compilations from AlignQC results, as well as Tables 6 and 7, and Figure 7Figures 15 were created processing text files built by AlignQC called ‘Raw data’ in their output. In addition, an in-depth gene and transcript analysis can be performed using the IGV.js library (https://github.com/igvteam/igv.js) [5657]. In this paper, we did not include all plots and tables created by the tool. To visualize the full latest reports, visit https://leoisl.gitlab.io/LR_EC_analyser_support/.

基于参考的长读纠错评价
使用Python语言开发了一个名为LR_EC_analyser的工具,用于分析长读取错误纠正器的输出,该工具可在https://gitlab.com/leoisl/LR_EC_analyser获得。
所需的参数是与引用注释的基因组对齐的原始和修正读取的BAM文件,以及Fasta文件格式的引用基因组和GTF文件格式的引用注释。
如果需要建立基因家族的图谱,也可以提供一个文件来指定子基因家族。
在我们的主分析中,通过从Ensembl r87小鼠基因中选择所有同源性为80%+的paralog来计算基因家系。
请注意,来自同一家族的paralog可能具有显著不同的长度,并且没有对覆盖率应用阈值。
完整的选择过程在此报告:https://gitlab.com/leoisl/LR_EC_analyser/blob/master/GettingParalogs.txt。
我们的方法的主要处理包括在输入BAMs上运行AlignQC软件[21](https://github.com/jason-weirather/AlignQC)并解析其输出以创建定制的图形。
然后它将信息聚合到一个HTML报告中。
例如,表2-4是根据AlignQC结果编译的,还有表6和表7,以及图7。
图1-5是在输出中处理AlignQC构建的名为“原始数据”的文本文件时创建的。
此外,可以使用IGV.js库(https://github.com/igvteam/igv.js)进行深入的基因和转录本分析[56,57]。
在本文中,我们没有包括该工具创建的所有图形和表。
要查看完整的最新报告,请访问https://leoisl.gitlab.io/LR_EC_analyser_support/。

Simulation framework for biases evaluation

In the simulation framework of Section Correction towards the major isoform is more prevalent when the alternative exon is small, exons length and number were chosen according to resemble what is reported in eukaryotes [58] (8 exons, 200 nucleotides). A skipped exon, whose size can vary, was introduced in the middle of the inclusion isoform. Skipped exon can have a size of 10, 50 or 100 nt. We also allowed the ratio of minor/major isoforms (M/mM/m⁠) to vary. For a coverage of CC and a ratio M/mM/m⁠, the number of reads coming from the major isoform is MCMC and the number of minor isoform reads is mCmC⁠. We chose relative abundances ratios for the inclusion isoform as such: 90/1090/10⁠, 75/2575/25 and 50/5050/50⁠. All reads are supposed to represent the full-length isoform. Finally for hybrid correction input, short reads of length 150 were simulated along each isoform, with 10×× and 100×× coverage.

During the simulation, we produced two versions of each read. The reference read is the read that represents exactly its isoform, without errors. The uncorrected read is the one in which we introduced errors. We used an error rate and profile that mimics observed R9.4 errors in ONT reads (total error rate of ∼13%, broken down as ∼5% of substitutions, ∼1% of insertions and ∼7% of deletions). After each corrector was applied to the read set, we obtained a triplet (reference, uncorrected, corrected) read that we used to assess the quality of the correction under several criteria.

We mapped the corrected reads on both exclusion and inclusion reference sequences using a fast Smith–Waterman implementation [59], from which we obtained a SAM file. It is expected that exclusion corrected reads will map on exclusion reference with no gaps, and that a deletion of the size of the skipped exon will be reported when mapping them to the inclusion. For each read, if it could be aligned to one of the two reference sequences in one block (according to the CIGAR), then we assigned it to this reference. If more blocks were needed, we assigned the read to the reference sequence with which the cumulative length of gaps is the lowest. We also reported the ratio between corrected reads size of each isoform kind and the real expected size of each reference isoform.

偏差评估的模拟框架
在模拟框架中,当备选外显子比较小时,针对主要异构体的部分校正更为普遍,外显子的长度和数量是根据真核生物[58]中报道的类似情况来选择的(8个外显子,200个核苷酸)。
被跳过的外显子,其大小可以变化,被引入到包含异构体的中间。
跳过外显子可以有一个大小为10、50或100元。我们也允许小的比例/主要亚型(⁠M / mM / M⁠)有所不同。
CC的覆盖率和M / mM / M⁠比率,读取来自主要的同种型的数量密度和小同种型读取的数量密度⁠。
我们选择相对丰度比值等夹杂物对碘氧基苯甲醚:90/1090/10⁠,75/2575/25和50/5050/50⁠。
所有的读数都应该表示全长亚型。
最后,混合校正输入沿各亚型模拟长度为150的短读,覆盖范围分别为10倍和100倍。

在模拟过程中,我们为每次读取生成两个版本。
引用读取是准确地表示其异构体而没有错误的读取。
未更正的读数是我们引入错误的读数。
我们使用了一个错误率和配置,模拟观察到ONT读取中的R9.4错误(总的错误率为~ ~ 13%,分解为替换的~ ~ 5%,插入的~ ~ 1%,删除的~ ~ 7%)。
将每个校正器应用到读集后,我们获得了一个三组(参考、未校正、已校正)的读数,我们使用该读数在几个标准下评估校正的质量。

我们使用快速的Smith-Waterman实现[59]将校正后的读取同时映射到排除和包含引用序列上,从中我们获得了一个SAM文件。
预期排除校正的读将映射到没有间隙的排除引用上,并且在映射到包含时将报告跳过的外显子大小的删除。
对于每次读取,如果它可以在一个块(根据雪茄)中与两个引用序列中的一个对齐,那么我们就将它分配给这个引用。
如果需要更多的块,我们将read分配给引用序列,在引用序列中间隔的累积长度最小。
我们还报告了每种isoform类型的校正读取大小和每个参考isoform的实际期望大小之间的比率。

Key Points
  • Long-read transcriptome sequencing is hindered by high error rates that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and the creation of gene catalogues.

  • This review evaluates the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads.

  • Existing tools significantly lower the error rate, but they also significantly perturb gene family sizes and isoform diversity.

要点

长read的转录组测序受到高错误率的阻碍,这些错误率会影响分析,如鉴定异构体、外显子边界、开放阅读框基因目录的创建

这篇综述评估了现有的长读DNA纠错方法能够纠正cDNA纳米孔读错的程度。

现有的工具显著地降低了错误率,但它们也显著地扰乱了基因家族的大小亚型多样性

Competing interests

JMA is one of the authors of the NaS error correction tool [29].

However, this study was designed and performed with no bias toward this particular tool.

JMA is part of the MinION Access Programme and received travel and accommodation expenses to speak at Oxford Nanopore Technologies conferences.

JMA是NaS纠错工具[29]的作者之一。

然而,本研究的设计和执行并没有偏向于这个特殊的工具。

JMA是小黄人访问计划的一部分,并接受了在牛津纳米孔技术会议上发言的旅行和住宿费用。

Acknowledgments

The authors thank the ASTER consortium, in particular Camille Sessegolo and Vincent Lacroix, for fruitful discussions. L.L. acknowledges CNPq/Brazil for the support. This work was performed using the computing facilities of the CC LBBE/PRABI and the France Génomique e-infrastructure (ANR-10-INBS-09-08).

Funding

This work was supported by the French National Research Agency [ANR ASTER, ANR-16-CE23-0001]; the INCEPTION project [PIA/ANR-16-CONV-0005]; and the Brazilian Ministry of Science, Technology and Innovation (in Portuguese, Ministério da Ciência, Tecnologia e Inovação—MCTI) through the National Counsel of Technological and Scientific Development (in Portuguese, Conselho Nacional de Desenvolvimento Científico e Tecnológico—CNPq), under the Science Without Borders (in Portuguese, Ciências Sem Fronteiras) scholarship grant [process number 203362/2014-4 to L.L.].

All authors are part of the ASTER project (ANR ASTER) with the purpose of developing algorithms and software for analysing third-generation sequencing data.

References