Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

Abstract

Motivation

Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited.

Results

In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type.

动机
纳米孔长读测序技术为高通量短读测序提供了很有前途的替代品，特别是在RNA测序领域。
然而，这种技术目前受到输出数据的高错误率的阻碍，这些错误率会影响诸如鉴定异构体、外显子边界、开放阅读框和创建基因目录等分析。
由于这些数据的新新性，计算方法仍在积极开发中，纳米孔rna测序长读的纠错选项仍然有限。

结果
在这篇文章中，我们评估现有的长读DNA纠错方法能够纠正cDNA纳米孔读错的程度。
我们提供了一个自动的和广泛的基准工具，不仅报告经典的错误修正指标，而且还修正对基因家族，亚型多样性，对主要亚型的偏差和剪接位点检测的影响。
我们发现，最初为DNA开发的长读纠错工具也适用于校正纳米孔rna测序数据，特别是在提高碱基对精度方面。
然而，研究者应该被警告的是，校正过程扰乱了基因家族的大小和亚型多样性。
根据应用程序类型的不同，本文提供了应该使用哪些(或者是否)错误纠正工具的指南。

Benchmarking software

https://gitlab.com/leoisl/LR_EC_analyser

long reads, RNA-sequencing, Nanopore, error correction, benchmark

Issue Section:

Review Article

Introduction

The most commonly used technique to study transcriptomes is through RNA-sequencing. As such, many tools were developed to process Illumina or short RNA-seq reads. Assembling a transcriptome from short reads is a central task for which many methods are available. When a reference genome or reference transcriptome is available, reference-based assemblers can be used (such as Cufflinks [1], Scallop [2], Scripture [3] and StringTie [4]). When no references are available, de novo transcriptome assembly can be performed (using tools such as Oases [5], SOAPdenovo-Trans [6], Trans-ABySS [7] and Trinity [8]). Potential disadvantages of reference-based strategies include the following:

(i) the resulting assemblies might be biased toward the used reference, and true variations might be discarded in favour of known isoforms;

(ii) they are unsuitable for samples with a partial or missing reference genome [8];

(iii) such methods depend on correct read-to-reference alignment, a task that is complicated by splicing, sequencing errors, polyploidism, multiple read mapping, mismatches caused by genome variation and the lack or incompleteness of many reference genomes [7, 8];

and (iv) sometimes, the model being studied is sufficiently different from the reference because it comes from a different strain or line such that the mappings are not altogether reliable [5].

On the other hand, some of the shortcomings of de novo transcriptome assemblers are as follows:

(i) low-abundance transcripts are likely to not be fully assembled [9];

(ii) reconstruction heuristics are usually employed, which may lead to missing alternative transcripts, and highly similar transcripts are likely to be assembled into a single transcript [10];

(iii) homologous or repetitive regions may result in incomplete assemblies [11];

and (iv) accuracy of transcript assembly is called into question when a gene exhibits complex isoform expression [11].

研究转录组最常用的技术是rna测序。
因此，许多工具被开发来处理Illumina或简短的RNA-seq读取。
从短读中组装转录组是一个中心任务，有许多方法可用。
当参考基因组或参考转录组可用时，可以使用基于参考的装配程序(如Cufflinks [1], Scallop [2], Scripture [3] and StringTie [4])。
当没有可用的引用时，可以执行从头转录组组装(使用诸如Oases[5]、soapnovov - trans[6]、Trans-ABySS[7]和Trinity[8]等工具)。
基于引用的策略的潜在缺点包括:

(i)最终的程序集可能会偏向于使用的引用，真实的变化可能会被抛弃而有利于已知的亚型;
(ii)不适合参考基因组部分或缺失[8]的样品;
(iii)这些方法依赖于正确的读-引用比对，而这一任务由于剪接、测序错误、多倍体、多读映射、基因组变异引起的不匹配以及许多参考基因组缺失或不完整而变得复杂[7,8];
并且(iv)有时，被研究的模型与参考有很大的不同，因为它来自于不同的应变或线，因此映射不完全可靠[5]。
另一方面，从头转录组组装器的一些缺点是:

(i)低丰度的转录本很可能没有被[9]完全组装;
(ii)通常采用重构启发式，这可能导致缺失替代转录本，高度相似的转录本很可能被组装成一个单一的转录本[10];
(iii)同源或重复区域可能导致不完整组装[11];
(iv)当一个基因表现出复杂的亚型表达[11]时，转录本组装的准确性就会受到质疑。

Recent advances in long-read sequencing technology have enabled longer, up to full-length sequencing of RNA molecules. This new approach has the potential to eliminate the need for transcriptome assembly and thus also eliminate from transcriptome analysis pipelines all the biases caused by the assembly step. Long-read sequencing can be done using either cDNA-based or direct RNA protocols from Oxford Nanopore (referred to as ‘ONT’ or ‘Nanopore’) and Pacific Biosciences (PacBio). The Iso-Seq protocol from PacBio consists in a size selection step, sequencing of cDNAs and finally a set of computational steps that produce sequences of full-length transcripts. ONT has three different experimental protocols for sequencing RNA molecules: cDNA transformation with amplification, direct cDNA (with or without amplification) and direct RNA.

Long-read sequencing is increasingly used in transcriptome studies, not just to prevent problems caused by short-read transcriptome assembly, but also for several of the following reasons. Mainly, long reads can better describe exon/intron combinations [12]. The Iso-Seq protocol has been used for isoform identification, including transcripts identification [13], de novo isoform discovery [14] and fusion transcript detection [15]. Nanopore has recently been used for isoform identification [16] and quantification [17].

The sequencing throughput of long-read technologies is significantly increasing over the years. It is now conceivable to sequence a full eukaryote transcriptome using either only long reads, or a combination of high-coverage long and short (Illumina) reads. Unlike the Iso-Seq protocol that requires extensive in silico processing prior to primary analysis [18], raw Nanopore reads can in principle be readily analysed. Direct RNA reads also permit the analysis of base modifications [19], unlike all other cDNA-based sequencing technologies. There also exist circular sequencing techniques for Nanopore such as INC-Seq [20] that aim at reducing error rates, at the expense of a special library preparation. With raw long reads, it is up to the primary analysis software (typically a mapping algorithm) to deal with sequences that have significant per-base error rate, currently around 13% [21].

最近在长读测序技术方面的进展使得RNA分子的长读测序成为可能。
这种新方法有可能消除对转录组组装的需要，从而也消除了转录组分析管道中组装步骤造成的所有偏差。
长读测序可以使用基于cdna或直接从牛津纳米孔(简称“ONT”或“Nanopore”)和太平洋生物科学(PacBio)的RNA协议来完成。
PacBio的Iso-Seq协议包括大小选择步骤、cDNAs测序以及最终产生全长转录本序列的一组计算步骤。
ONT有三种不同的RNA分子测序实验方案:有扩增的cDNA转化、直接cDNA(有或没有扩增)和直接RNA。

长读测序越来越多地应用于转录组研究，不仅是为了防止短读转录组组装造成的问题，也是出于以下几个原因。
主要来说，长reads能更好地描述外显子/内含子组合[12]。
Iso-Seq协议已用于异构体鉴定，包括转录本鉴定[13]，从头发现[14]和融合转录本检测[15]。
纳米孔最近被用于[16]的异构体鉴定和[17]的定量。

这些年来，长读技术的测序通量显著增加。
现在可以想象，使用长读段或高覆盖率的长、短读段(Illumina)组合来对一个完整的真核生物转录组进行排序。
不同于Iso-Seq协议，它需要在[18]初级分析之前进行大量的硅处理，原始的纳米孔读码原则上可以很容易地进行分析。
不像其他基于cdna的测序技术，直接RNA读取也允许分析碱基修饰[19]。
还有一些纳米孔的循环测序技术，如inco - seq[20]，目的是降低错误率，但需要特殊的文库准备。
对于原始的长读取，主要的分析软件(通常是一个映射算法)来处理具有显著的每基错误率的序列，目前在13%[21]左右。

In principle, a high error rate in the data complicates the analysis of transcriptomes especially for the accurate detection of exon boundaries, or the quantification of similar isoforms and paralogous genes. Reads need to be aligned unambiguously and with high base pair accuracy to either a reference genome or transcriptome. Indels (i.e. insertions/deletions) are the main type of errors produced by long-read technologies, and they confuse aligners more than substitution errors [22]. Many methods have been developed to correct errors in RNA-seq reads, mainly in the short-read era [23, 24]. They no longer apply to long reads because they were developed to deal with low error rates, and principally substitutions. However, a new set of methods has been proposed to correct genomic long reads. There exist two types of long-read error correction algorithms, those using information from long reads only (‘self’ or ‘non-hybrid’ correction), and those using short reads to correct long reads (‘hybrid’ correction). In this article, we will report on the extent to which state-of-the-art tools enable to correct long noisy RNA-seq reads produced by Nanopore sequencers.

Several tools exist for error-correcting long reads, including ONT reads. Even if the error profiles of Nanopore and PacBio reads are different, the error rate is quite similar and it is reasonable to expect that tools originally designed for PacBio data to also perform well on recent Nanopore data. There is, to the best of our knowledge, very little prior work that specifically addresses error correction of RNA-seq long reads. Notable exceptions include the following: (i) LSC [25], which is designed to error correct PacBio RNA-seq long reads using Illumina RNA-seq short reads; (ii) PBcR [26] and (iii) HALC [27], which are mainly designed for genomes but are also evaluated on transcriptomic data. Here we will take the standpoint of evaluating long-read error correction tools on RNA-seq data, most of which were designed to process DNA sequencing data only.

We evaluate the following DNA hybrid correction tools: HALC [27], LoRDEC [28], NaS [29], PBcR [26] and proovread [30]; and the following DNA self-correction tools: Canu [31], daccord [32], LoRMA [33], MECAT [34] and pbdagcon [35]. We also evaluate an additional hybrid tool, LSC [25], the only one specifically designed to error correct (PacBio) RNA-seq long reads. A majority of hybrid correction methods employ mapping strategies to place short fragments on long reads and correct long read regions using the related short-read sequences. But some of them rely on graphs to create a consensus that is used for correction. These graphs are either k-mer graphs (de Bruijn graphs) or nucleotide graphs resulting from multiple alignments of sequences (partial order alignment). For self-correction methods, strategies using the aforementioned graphs are the most common. We have also considered evaluating nanocorrect [36], nanopolish [36], Falcon_sense [37] and LSCPlus [38], but some tools were deprecated, not suitable for read correction or unavailable. Our detailed justifications can be found in Section S1.12 of the Supplementary Data. We have selected what we believe is a representative set of tools but there also exist other tools that were not considered in this study, e.g. HG-Color [39], HECIL [40], MIRCA [41], Jabba [42], nanocorr [43] and Racon [44].

原则上，数据的高错误率使转录组的分析复杂化，特别是准确检测外显子边界，或定量的相似亚型和泛子基因。
读码需要明确地与参考基因组或转录组进行比对，并具有高的碱基对准确性。
Indels(即插入/删除)是长时间读取技术产生的主要类型错误，它们比替换错误[22]更容易混淆对准器。
目前已经开发了许多方法来纠正RNA-seq读取中的错误，主要是在短读时代[23,24]。
它们不再适用于长时间阅读，因为它们被开发用来处理低错误率，主要是替换。
然而，一套新的方法已被提出，以纠正基因组长读。
存在两种类型的长读纠错算法，一种是只使用长读的信息(“self”或“non-hybrid”纠错)，另一种是使用短读来纠正长读(“hybrid”纠错)。
在这篇文章中，我们将报道在何种程度上最先进的工具能够纠正长噪声rna序列读取产生的纳米孔测序仪。

有一些工具可以纠正长时间读取的错误，包括ONT读取。
即使Nanopore和PacBio读取的误差分布不同，错误率也相当相似，我们有理由期待最初为PacBio数据设计的工具也能在最近的Nanopore数据上表现良好。
就我们所知，以前很少有工作专门针对RNA-seq长读的错误修正。
值得注意的例外包括:

(i) LSC[25]，它被设计用于使用Illumina RNA-seq短读错误纠正PacBio RNA-seq长读;
(ii) PBcR[26]

和(iii) HALC[27]，它们主要是为基因组设计的，但也在转录组数据上进行评估。
在这里，我们将站在评估RNA-seq数据上的长读错误修正工具的立场，其中大多数是设计来处理DNA测序数据。

我们评估了以下DNA杂交校正工具:HALC [27]， LoRDEC [28]， NaS [29]， PBcR[26]和proovread [30];
以及以下DNA自校正工具:Canu [31]， daccord [32]， LoRMA [33]， MECAT [34]， pbdagcon[35]。
我们还评估了另一个混合工具LSC[25]，它是唯一一个专门设计用于纠错(PacBio) RNA-seq长读取的工具。
大多数混合校正方法都采用映射策略，将短片段放在长读上，并使用相关的短读序列对长读区域进行校正。
但是他们中的一些人依靠图表来建立共识，并用于修正。
这些图要么是k-mer图(de Bruijn图)，要么是由多次序列比对(部分顺序比对)得到的核苷酸图。
对于自我校正方法，使用上述图表的策略是最常见的。
我们也考虑过评估nanocorrect [36]， nanopolish [36]， Falcon_sense[37]和LSCPlus[38]，但有些工具已经过时，不适合read校正或不可用。
我们的详细理据可参阅补充资料第S1.12节。
我们选择了一组我们认为具有代表性的工具，但也有其他本研究未考虑的工具，如:HG-Color[39]、HECIL[40]、MIRCA[41]、Jabba[42]、nanocorr[43]和Racon[44]。

Table 1

Main characteristics of the error correction tools considered in this study

(A) Hybrid tools
	HALC	LoRDEC	LSC	NaS	PBcR	Proovread
Reference	[27]	[28]	[25]	[29]	[26]	[30]
Context	RNA or DNA	DNA	RNA	DNA	RNA or DNA	DNA
Technology	PacBio	PacBio or ONT	PacBio	ONT	PacBio or ONT	PacBio
Main algorithmic idea	Aligns short reads contigs to long reads. HALC implements a strategy to desambiguify multiple alignments instead of avoiding them. It uses a short reads contig graph in order to select the best set of contigs to correct a long read region.	Construction of short read de Bruijn graph (dBG), path search between k-mers in long reads. Regions between k-mers are corrected with the optimal path.	Operate a homopolymer compression of short and long reads to increase alignment recall. Recruits short reads on long reads sequences. Corrects errors from short reads sequences and uses homopolymers from short reads to replace those in long reads.	Long reads are used as templates to recruit short (seed) reads through alignment. The set of seeds is extended by searching, using shared k-mers, for similar reads in the initial read set. A micro-assembly of the reads is performed. Resulting contigs are aligned back to the input long reads, and a path in the contig graph is used as the corrected read.	Alignment of short reads to long reads, and multi-alignment of the short reads recruited to one region. A consensus is derived from the multiple alignment.	Alignment of short reads to long reads and consensus. Uses a specific scoring system to adapt the mapping to the high error rates.
(B) Non-hybrid tools
	Canu	daccord	LoRMA	MECAT	pbdagcon
Reference	[31]	[32]	[33]	[34]	[35]
Context	DNA	DNA	DNA	DNA	DNA
Technology	PacBio or ONT	PacBio	PacBio or ONT	PacBio or ONT	PacBio
Main algorithmic idea	First, all-versus-all read overlap and alignment filtering. Directed acyclic graph is built from the alignments, which produce quasi-multiple alignments. Highest-weight path search in these graph yield consensus sequences that are used for correction. 首先，全对全读重叠和对齐滤波。由这些对准构造有向无环图，产生拟多重对准。这些图中的最高权值路径搜索产生用于校正的一致序列。	Reads are compared pairwisely to obtain alignment piles. Several overlapping windows of hundreds of nucleotides are derived from these piles, on which micro-assembly (DBG with very small k-mers) is performed. A heaviest path in each DBG is heuristically selected as the consensus to correct the given window. 对读数进行配对比较，得到对齐桩。数百个核苷酸的几个重叠窗口从这些堆中得到，在这些堆上进行微组装(带有非常小的k-mers的DBG)。每个DBG中的最重路径被启发式地选择为一致意见，以纠正给定的窗口。	Path search in dBG built from long reads. Multi-iterations over k-mer size for graph construction. The same framework than LoRDEC is used to correct read regions.	k-mer-based read matching, pairwise alignment between matched reads, alignment-based consensus calling on ’easy’ regions, local consensus calling (partial order graph) otherwise.	Align long reads to ‘backbone’ sequences, correction by iterative directed acyclic graph consensus calling from the multiple sequence alignments.

Open in new tab

Other works have evaluated error correction tools in the context of DNA sequencing.

LRCstats [45], and more recently ELECTOR [46], provide automated evaluations of genomic long read correction using a simulated framework. A technical report from [47] provides an extensive evaluation of PacBio/Nanopore error correction tools, in the context of de novo assembly. This analysis is completed with more recent results in [48] on hybrid correction methods. Perhaps the closest work to ours is the AlignQC software [21], which provides a set of metrics for the evaluation of RNA-sequencing long-read dataset quality. In [21] a comparison is provided between Nanopore and PacBio RNA-sequencing datasets in terms of error patterns, isoform identification and quantification. While [21] did not compare error correction tools, we will use and extend AlignQC metrics for that purpose.

In this article, we will focus on the qualitative and quantitative measurements of Nanopore error-corrected long reads, with transcriptomic features in mind. First we examine basic metrics of error correction, e.g. mean length, base accuracy, homopolymers errors and performance (running time, memory) of the tools. Then we ask several questions that are specific to transcriptome applications: (i) How is the number of detected genes, and more precisely the number of genes within a gene family, impacted by read error correction? (ii) Can error correction significantly change the number of reads mapping to genes or transcripts, possibly affecting downstream analysis based on these metrics? (iii) Do error correction tools perturb isoform diversity, e.g. by having a correction bias toward the major isoform? (iv) What is the impact of error correction on identifying splice sites? To answer these questions, we provide an automatic framework (LC_EC_analyser, Methods) for the evaluation of transcriptomic error correction methods, that we apply to 11 different error correction tools.

其他的工作是在DNA测序的背景下评估错误修正工具。
LRCstats[45]和最近的ELECTOR[46]使用模拟框架提供基因组长读校正的自动评估。
[47]提供了一份技术报告，在从头组装的背景下，对PacBio/Nanopore误差修正工具进行了广泛的评估。
这一分析是通过混合校正方法[48]中最新的结果完成的。
也许与我们最接近的工作是AlignQC软件[21]，它提供了一套衡量rna测序长时间读取数据集质量的指标。
在[21]中，提供了纳米孔和PacBio rna测序数据集在错误模式、亚型识别和定量方面的比较。
虽然[21]没有比较纠错工具，但我们将使用并扩展AlignQC指标。

在这篇文章中，我们将集中在定性和定量测量的纳米孔错误纠正长读，铭记转录组特征。
首先，我们检查误差校正的基本指标，例如平均长度、基础精度、均聚物误差和工具的性能(运行时间、内存)。
然后我们问了几个针对转录组应用的问题:

(i)检测到的基因的数量，更准确地说是一个基因家族中的基因数量，是如何受到读错校正的影响的?
(ii)纠错是否会显著改变定位到基因或转录本的reads数量，从而可能影响基于这些指标的下游分析?
(iii)纠错工具是否会干扰异型的多样性，例如对主要异型的纠错偏向?
(四)纠错对拼接位点识别的影响是什么?
为了回答这些问题，我们提供了一个用于评估转录纠错方法的自动框架(LC_EC_analyser, Methods)，我们将其应用于11种不同的纠错工具。

Results

Error correction tools

Table 1 presents the main characteristics of the hybrid and non-hybrid error correction tools that were considered in this study. For the sake of reproducibility, in the Supplementary Data Section S1 are described all the versions, dependencies and parameters. Note that these error correction tools were all tailored for DNA-seq data except for LSC.

The output of each error correction method can be classified into one of the four following types: full-length, trimmed, split and micro-assembly. Usually, due to methodological reasons, extremities of long reads are harder to correct. As an example, hybrid correctors based on mapping short to long reads, and calling a consensus from the mapping, have difficulties aligning short reads to the extremities of long reads. As such, some methods output ‘trimmed error-corrected reads’, i.e. error-corrected reads such that their uncorrected ends are removed. Examples of methods producing this type of output considered in this study are HALC, LoRDEC, LSC, proovread, daccord and pbdagcon. Sometimes, internal parts of long reads can also be hard to correct, due to a lack of coverage of short reads, or a drop of sequencing quality, or due to mapping issues. Some algorithms thus output ‘split error-corrected reads’, splitting one long read into several well-corrected fragments, such as HALC, LoRDEC, PBcR and LoRMA. Finally, some tools decide to not trim nor split the original reads, outputting ‘full-length error-corrected reads’. Examples include HALC, LoRDEC, LSC, proovread, canu, daccord, MECAT and pbdagcon. NaS does not fit the previous three categories, as it uses a micro-assembly strategy, instead of a classical polishing of the consensus, in which the long read is used as a template to recruit Illumina reads and, by performing a local assembly, build a high-quality synthetic read. As can be noted, some tools produce more than one type of output, sometimes three types. In the following sections, we add the suffixes ‘(t)’ and ‘(s)’ to a tool name to denote its trimmed and split outputs, respectively. We further add the suffix ‘(⁠ $μ$

纠错工具
表1给出了本研究中考虑的混合和非混合误差修正工具的主要特征。
为了便于再现，在补充数据部分S1中描述了所有的版本、依赖关系和参数。
请注意，除了LSC之外，这些纠错工具都是针对dna测序数据定制的。

每种误差校正方法的输出可以分为四种类型:全长型、修剪型、分裂型和微装配型。
通常，由于方法上的原因，长读的极端部分更难纠正。
例如，基于短读到长读映射和从映射中调用一致意见的混合校正器很难将短读对齐到长读的末端。
因此，一些方法输出“修剪过的错误校正读取”，也就是说，错误校正读取的未校正的端被删除。
本研究中考虑的产生这种类型输出的方法有HALC、LoRDEC、LSC、proovread、daccord和pbdagcon。
有时，长读的内部部分也很难纠正，因为缺少对短读的覆盖，或测序质量下降，或由于映射问题。
一些算法因此输出“分裂错误校正读取”，将一个长读分裂成几个良好校正的片段，如HALC, LoRDEC, PBcR和LoRMA。
最后，一些工具决定不修剪或分割原始读取，输出“完整长度的错误修正读取”。
例如HALC, LoRDEC, LSC, proovread, canu, daccord, MECAT和pbdagcon。
NaS不符合前三种类型，因为它使用微装配策略，而不是对共识的经典抛光，即使用长读作为模板来招募Illumina读，并通过执行局部装配，构建高质量的合成读。
可以注意到，一些工具会产生不止一种类型的输出，有时是三种类型。
在下面的部分中，我们将在工具名中添加后缀' (t) '和' (s) '，分别表示其裁剪和分割输出。
我们进一步添加后缀”(⁠μμ⁠)的NaS,提醒人们,它是基于一个micro-assembly策略。
没有后缀的输出被认为是全长度校正。
例如，HALC表示HALC完整长度的错误纠正读取，HALC(t)表示HALC修剪输出，HALC(s)表示HALC分割输出。
正如我们将看到的，没有一种输出类型在所有指标上都优于其他所有类型。
选择适当的输出类型在很大程度上取决于应用程序。

Evaluation datasets

Our main evaluation dataset consists of a single 1D Nanopore run using the cDNA preparation kit of RNA material taken from a mouse brain, containing 740 776 long reads. An additional Illumina dataset containing 58 million paired end 151 bp reads was generated on the same sample but using a different cDNA protocol. For more details on the sequencing protocol, see Methods section. The Nanopore and Illumina reads from the mouse RNA sample are available in the ENA repository under the following study: PRJEB25574. In this paper, we provide a detailed analysis of this dataset, from Error-correction improves base accuracyand splits, trims, or entirely removes reads section to Usinga different read aligner mildly but not significantly affects theevaluation section.

评估数据集
我们的主要评估数据集包括一个单一的1D纳米孔运行使用cDNA制备试剂盒RNA材料从小鼠大脑，包含740 776长读。
另外一个Illumina数据集包含5800万对末端151 bp reads，是在同一样本上产生的，但使用了不同的cDNA协议。
有关序列协议的更多细节，请参见方法部分。
从小鼠RNA样本中提取的纳米孔和Illumina reads可在ENA知识库中用于以下研究:PRJEB25574。
在本文中，我们对该数据集进行了详细的分析，从纠错改进基础精度、分割、修剪或完全删除读取部分到使用一个不同的读取对准器轻度但不显著地影响评估部分。

Table 2

Statistics of error correction tools on the 1D run RNA-seq dataset. To facilitate the readability of this table and the next ones, we highlighted values that we deemed satisfactory in green colour, borderline in brown and unsatisfactory in red, noting that such classification is somewhat arbitrary.

(A) Hybrid tools
	Raw	HALC	HALC(t)	HALC(s)	LoRDEC	LoRDEC(t)	LoRDEC(s)	LSC	LSC(t)	NaS(μ)	PBcR(s)	Proovread	Proovread(t)
nb reads	741 k	741 k	709 k	914 k	741 k	677 k	1388 k	619 k	619 k	619 k	1321 k	738 k	626 k
mapped reads	83.5%	88.1%	95.6%	98.8%	85.5%	95.5%	97.5%	97.1%	97.6%	98.7%	99.2%	85.5%	98.9%
mean length	2011	2174	1926	1378	2097	1953	816	2212	1901	1931	776	2117	1796
nb bases	1313 M	1469 M	1334 M	1245 M	1394 M	1289 M	1106 M	1332 M	1151 M	1179 M	1015 M	1400 M	1112 M
mapped basesa	89.0%	90.3%	96.6%	99.2%	90.6%	95.9%	99.1%	90.9%	97.7%	97.5%	99.2%	92.4%	99.5%
error rateb	13.72%	1.85%	1.32%	0.44%	4.5%	3.73%	1.59%	5.45%	4.36%	0.38%	0.68%	2.65%	0.33%
nb detected genes	16.8 k	17.0 k	16.8 k	16.6 k	16.8 k	16.6 k	16.5 k	16.5 k	16.2 k	15.0 k	15.6 k	16.6 k	14.6 k
(B) Non-hybrid tools
	Raw	Canu	daccord	daccord(t)	LoRMA(s)	MECAT	pbdagcon	pbdagcon(t)
nb reads	741 k	519 k	675 k	840 k	1540 k	495 k	778 k	775 k
mapped reads	83.5%	99.1%	92.5%	94.0%	99.4%	99.4%	98.2%	97.9%
mean length	2011	2192	2002	1476	497	1992	1473	1484
nb of bases	1313 M	1125 M	1350 M	1212 M	760 M	980 M	1136 M	1141 M
mapped basesa	89.0%	92.0%	92.5%	94.7%	99.2%	96.9%	97.0%	96.7%
error rateb	13.72%	6.43%	5.2%	4.12%	2.91%	4.57%	5.65%	5.71%
nb detected genes	16.8 k	12.4 k	15.5 k	13.9 k	6.8 k	10.4 k	13.2 k	13.2 k

aAs reported by AlignQC. Percentage of bases aligned among mapped reads, taken by counting the M parts of CIGAR strings in the BAM file. Bases in unmapped reads are not counted. bAs reported by AlignQC, using a sample of 1 million bases from aligned reads segments.

Open in new tab

In order to obtain a more comprehensive understanding of the evaluated tools, we further analysed the correction of the methods on one human Nanopore direct RNA-sequencing data from the Nanopore-WGS-Consortium (dataset from centre Bham, run#1, sample type RNA, kit SQK-RNA001, pore R9.4, available at https://github.com/nanopore-wgs-consortium/NA12878/blob/master/nanopore-human-transcriptome/fastq_fast5_bulk.md). We concatenated the fail and pass RNA-direct reads from the aforementioned dataset, obtaining 894 289 reads. Further, to correctly run all tools, we transformed bases U into T.

Error correction improves base accuracy and splits, trims or entirely removes reads

Table 2 shows an evaluation of error correction based on AlignQC results, for the hybrid and non-hybrid tools. The error rate is 13.72% in raw reads, 0.33–5.45% for reads corrected using hybrid methods and 2.91–6.43% with self-correctors. Notably, the hybrid tools NaS(⁠ $μ$

In terms of throughput after the correction step, tools that do not trim nor split reads tend to return a number of reads similar to that of the uncorrected (raw) reads. Notably, HALC and LoRDEC returned exactly the same number of reads, and Proovread returned just 3k less reads. On the other hand, Canu and MECAT decreased a lot the number of output reads, probably due to post-filtering procedures. Moreover, many of full-length outputs (HALC, LoRDEC, LSC, Proovread and daccord) increased the mean length of the raw reads while also increasing the number of output bases, showing that they tend to further extend the information contained in the long reads.

Trimming almost always decreased the number of output reads, like in HALC(t), LoRDEC(t), Proovread(t) and pbdagcon(t), probably due to post-filtering procedures. However, in LSC(t), trimming has no effect on the number of reads, and in daccord(t), trimming actually increased the number of reads. In half of the trimmed outputs (HALC(t), LoRDEC(t) and LSC(t)), the mean length of the reads was usually preserved, decreasing only by around 100 bps. However, in the other half (Proovread(t), daccord(t), and pbdagcon(t)), read lengths were, on average, reduced by 200–500 bp.

Splitting reads significantly increased the number of output reads, as expected. LoRDEC(s), PBcR(s) and LoRMA(s) tend to split reads into two or more shorter reads during the correction step, as they return $\sim$

These observations indicate that care should be taken when considering which type of output should be used. For example, all split and half of the trimmed outputs should not be used in applications trying to describe the full transcript structure, or distant exons coupling, as the long read connectivity is lost in many cases in these types of outputs.

Overall, no correction tool outperforms all the others across the metrics analysed in this section. However, hybrid correctors are systematically better than self-correctors at decreasing the error rate (and preserving the transcriptome diversity, as we will discuss in the next section). Trimming and splitting usually increase the read accuracy (and also mapping rate, as we see next), but decrease the total amount of bases in the read set and the mean read length, which can lead to loss of long-range information that was present in the raw reads.

错误校正提高基础精度，并分割、修剪或完全删除读取
表2显示了基于AlignQC结果的混合和非混合工具的错误校正评估。
原始读取的错误率为13.72%，混合方法校正的错误率为0.33 5.45%，自校正的错误率为2.91 6.43%。
值得注意的是，混合工具NaS(liu liu)、Proovread(t)和HALC(s)分别输出微组装、修剪和分裂的错误纠正读出，错误率最低(0.5%)。
我们观察到HALC产生的全长度纠错读取的错误率最低(1.85%)，但仍显著高于上述三种方法的错误率。
这是可以预料到的，因为微组装、微调或拆分读取通常不会保留读取中错误修正的区域，从而降低误差率。
唯一的分割自校正工具LoRMA(s)是非混合工具中降低错误率最高的，但仍仅达到2.91%，比最好的混合校正器高出一个数量级。
如果我们观察自校正器之间的非分裂输出，MECAT和daccord(t)分别获得了全长和修剪错误校正读数的最低错误率，但错误率仍高于4%。
从错误率来看，最好的错误校正器是混合的，这并不奇怪，因为它们使用了额外的高质量Illumina读取。
正如预期的那样，修剪和拆分错误校正读取大大降低了错误率，例如LoRDEC通过修剪将错误率从4.5%降低到3.73%，通过拆分将错误率降低到1.59%。
因此，就错误率而言，分割输出始终优于修剪和全长输出。
详细的错误率分析将在错误率分析部分进行。
就校正步骤之后的吞吐量而言，不修剪或分割读取的工具往往会返回与未校正(原始)读取类似的大量读取。
值得注意的是，HALC和LoRDEC返回的读取数完全相同，而Proovread只返回少3k的读取数。
另一方面，Canu和MECAT减少了大量的输出读取，可能是由于后滤波过程。
此外，许多全长输出(HALC、LoRDEC、LSC、Proovread和daccord)增加了原始读取的平均长度，同时也增加了输出基的数量，表明它们倾向于进一步扩展长读取中包含的信息。
修剪几乎总是减少输出读取的数量，如HALC(t)、LoRDEC(t)、Proovread(t)和pbdagcon(t)，可能是由于后滤波过程。
然而，在LSC(t)中，修剪对读取的数量没有影响，而在daccord(t)中，修剪实际上增加了读取的数量。
在一半的剪切输出(HALC(t)、LoRDEC(t)和LSC(t))中，读取的平均长度通常保持不变，仅减少约100 bp。
然而，在另一半(Proovread(t)、daccord(t)和pbdagcon(t))中，读取长度平均减少了200 500 bp。
正如预期的那样，分割读取显著增加了输出读取的数量。
LoRDEC (s), PBcR (s)和LoRMA (s)倾向于读分割成两个或两个以上的短读校正步骤期间,当他们返回两个读取校正后,时间很短(分别指的是长度,816、776和497个基点,而在原始读取2011个基点)和整体大大减少基地总(分别损失。
207、298和553兆比特)。
另一方面，HALC(s)只增加了23%的读取次数，没有明显的碱基损失，但仍然显著减少了平均长度(1378 bp)。
我们观察到，基于微组装的NaS(倾斜度)获得的平均长度与修剪输出相似(HALC(t)、LoRDEC(t)和LSC(t))。
这表明NaS在获取种子短读或将短读映射到长读的末端或将映射到末端的reads组合上都有困难。
这些观察结果表明，在考虑应该使用哪种类型的输出时应该谨慎。
例如，所有的分裂输出和一半的修剪输出不应该用于试图描述完整的转录结构或远外显子耦合的应用程序，因为在这些类型的输出中，长读连接在很多情况下会丢失。
总体而言，在本部分分析的指标中，没有任何修正工具优于所有其他工具。
然而，在降低错误率(和保存转录组多样性，我们将在下一节讨论)方面，混合校正器系统上比自校正器更好。
修剪和分割通常会提高读取精度(也会提高映射率，正如我们接下来看到的)，但会减少读取集中的碱基总数和平均读取长度，这会导致原始读取中存在的远程信息的丢失。

Error correction facilitates mapping yet generally lowers the number of detected genes

Apart from HALC, LoRDEC, Proovread and daccord, for which only 85–92% of reads were mapped, corrected reads from all the other tools were mapped at a rate of 94–99%, showing a significant improvement over raw reads (mapping rate of 83.5%). We observe that these four tools with the lowest percentages of mapped reads had high mean read length, indicating that trimming, splitting or discarding reads seems necessary in order to obtain shorter but overall less error-prone reads. In general in all tools (except pbdagcon), trimming and splitting increased the proportion of mapped reads and bases, sometimes significantly (e.g. Proovread). However some tools that do not offer the option to trim or split reads, such as Canu and MECAT, showed very high mapping rate with high mean read length and error rate. This is related to their aggressive post-filtering measure, which removed a significant portion of the reads (29–33%).

On verifying if error correctors are able to preserve transcriptome diversity, we can see a striking difference between hybrid and self-correctors; in general, hybrid correctors present a far higher number of detected genes than the self ones. Interestingly, HALC was able to even increase the number of detected genes by 221 with regard to the raw reads, indicating that some genes were maybe not detected before due to imperfect mapping caused by the high error rate. We also found that 72 genes were detected in the raw reads but not in any of the error-corrected outputs. Furthermore, 354 genes are absent from the results of nearly all correction methods (⁠ $\geq$

Overall, all hybrid tools presented a satisfactory amount of detected genes, except for NaS(⁠ $μ$

纠错有助于定位，但通常会降低被检测基因的数量
除了HALC、LoRDEC、Proovread和daccord(只有85 92%的读取被映射)之外，其他工具校正的读取被映射的比率为94 99%，比原始读取(映射率为83.5%)有了显著提高。
我们观察到，这四种具有最低映射读取百分比的工具具有较高的平均读取长度，这表明为了获得更短但总体上不容易出错的读取，修剪、拆分或丢弃读取似乎是必要的。
一般来说，在所有的工具中(除了pbdagcon)，修剪和分割增加了映射读和基的比例，有时显著(例如Proovread)。
然而，一些不提供修剪或分割读取选项的工具，如Canu和MECAT，显示出非常高的映射率，具有很高的平均读取长度和错误率。
这与他们激进的后过滤措施有关，后者删除了相当一部分读(29 33%)。
在验证错误校正器是否能够保持转录组多样性时，我们可以看到杂交和自校正器之间的显著差异;
一般来说，杂交校正者比自身校正者所检测到的基因数量要多得多。
有趣的是，HALC甚至能够将原始reads检测到的基因数量增加221个，这说明由于高错误率导致的绘图不完善，一些基因可能之前没有被检测到。
我们还发现，在原始读取中检测到72个基因，但在任何修正错误的输出中都没有检测到。
此外，几乎所有校正方法的结果中都缺少354个基因(19个中的16个)。
总的来说，所有杂交工具都显示出了令人满意的检测基因数量，除了NaS(爆炸声)、PBcR(s)和Proovread (t)，而自校正器则没有任何令人满意的结果，其中Canu、LoRMA(s)和MECAT将原始reads中报告的检测基因数量减少了35 59%。
我们还可以注意到系统的修剪和分裂导致了检测新基因的敏感性的丧失。
此外,除了HALC (s)、工具具有很高比例的映射读取(NaS(μμ),PBcR (s), Proovread (t) Canu, LoRMA (s), MECAT, pbdagcon和pbdagcon (t))已发现的基因数量最大的损失,暗示纠错可以减少基因多样性的降低错误率,和/或集群相似的基因(如paralogous)是对单个基因修正。
因此，如果下游应用需要保留转录组多样性，则应该完全避免使用自校正器，以及一些混合校正器(NaS(发文)、PBcR(s)和Proovread(t))。

Detailed error rate analysis

The high error rate of transcriptomic long reads significantly complicates their primary analysis [49]. While Error-correction improves base accuracy and splits, trims, or entirelyremoves reads section presented a general per-base error rate, this section breaks down sequencing errors into several types and examines how each error correction tool deals with them. A general, and expected, trend that we find in all tools and in all types of errors is that trimming and splitting the reads result in less substitutions, deletions and insertion errors. We will therefore focus in other aspects in this analysis. The data presented here is a compilation of AlignQC results. Note that AlignQC computed the following metrics only on reads that could be aligned; thus unaligned reads are not counted, yet they may possibly be the most erroneous ones. AlignQC also subsampled aligned reads to around 1 million bases to calculate the presented values.

详细错误率分析
高错误率的转录长读显著复杂化他们的初步分析[49]。
当错误校正提高基础精度和分割，修剪，或完全删除读取部分提出了一个一般的每个基础错误率，本节将排序错误分解为几种类型，并检查每个错误校正工具如何处理它们。
我们在所有工具和所有类型的错误中发现的一个普遍的和预期的趋势是，修剪和拆分读取导致更少的替换、删除和插入错误。
因此，在本分析中，我们将着重于其他方面。
这里提供的数据是AlignQC结果的汇编。
请注意，AlignQC仅在可以对齐的读取上计算以下指标;
因此，未对齐读取不被计算，但它们可能是最错误的。
AlignQC也下采样对齐读到大约100万个碱基来计算呈现的值。

Deletions are the most problematic sequencing errors

Table 3 shows the error rate in the raw reads and in the corrected reads for each tool. In raw reads, deletions are the most prevalent type of errors (7.41% of bases), closely followed by subsitutions (5.11%), then insertions (1.2%). LoRDEC, LSC and LSC(t) are the least capable of correcting mismatches (>2% of them remaining), even though they are all hybrid tools. For LoRDEC, we were able to verify that this is related to the large amount of uncorrected reads in its output (90k totally uncorrected reads out of 741 k - 12%), as computed by exactly matching raw reads to its corrected output. For LSC and LSC(t), we were unable to pinpoint a reason. The majority of other hybrid tools (HALC, HALC(t), HALC(s), NaS(⁠ $μ$

删除是最有问题的排序错误。

表3显示了每个工具的原始读取和修正读取中的错误率。
在原始读取中，删除是最常见的错误类型(占7.41%的碱基)，紧接着是替换(5.11%)，然后是插入(1.2%)。
虽然LoRDEC、LSC和LSC(t)都是混合工具，但它们纠正不匹配的能力最差(剩下的2%)。
对于LoRDEC，我们能够验证这与它的输出中大量未校正的读取有关(从741 k中完全未校正的读取90k - 12%)，这是通过将原始读取与校正后的输出精确匹配来计算的。
对于LSC和LSC(t)，我们无法查明原因。
大多数其他混合工具(HALC、HALC(t)、HALC(s)、NaS(爆裂爆裂)、PBcR(s)、Proovread和Proovread(t))导致的替代误差小于1%。
令人惊讶的是，非混合工具也呈现出非常低的错配率;
除Canu(1.33%)和daccord(1.1%)外，其余均低于1%。
这表明，在ONT数据中，系统替代错误的比率很低，因为即使没有Illumina reads，自校正器也能获得与混合校正器相当的结果。
尽管如此，三种性能最好的工具都是hybrid (NaS, PBcR，和Proovread, t)，因此对于需要非常低失配率的应用程序，它们应该是首选。

Table 3

Error rate in the raw reads and in the corrected reads for each tool, on the 1D run RNA-seq dataset, computed from 1M random aligned bases.

(A) Hybrid tools
	Raw	HALC	HALC(t)	HALC(s)	LoRDEC	LoRDEC(t)	LoRDEC(s)	LSC	LSC(t)	NaS(⁠ $μ$	PBcR(s)	Proovread	Proovread(t)
Error rate	13.72%	1.85%	1.32%	0.44%	4.5%	3.73%	1.59%	5.45%	4.36%	0.38%	0.68%	2.65%	0.33%
Mismatch	5.11%	0.79%	0.54%	0.22%	2.04%	1.76%	1.13%	2.35%	2.01%	0.2%	0.18%	0.93%	0.13%
Deletion	7.41%	0.85%	0.64%	0.17%	2.15%	1.73%	0.39%	2.64%	1.94%	0.09%	0.3%	1.51%	0.18%
Insertion	1.2%	0.21%	0.14%	0.05%	0.32%	0.24%	0.07%	0.47%	0.4%	0.08%	0.19%	0.22%	0.03%
(B) Non-hybrid tools
	Raw	Canu	daccord	daccord(t)	LoRMA(s)	MECAT	pbdagcon	pbdagcon(t)
Error rate	13.72%	6.43%	5.2%	4.12%	2.91%	4.57%	5.65%	5.71%
Mismatch	5.11%	1.33%	1.1%	0.67%	0.37%	0.33%	0.49%	0.49%
Deletion	7.40%	4.82%	3.82%	3.27%	2.51%	4.18%	5.06%	5.17%
Insertion	1.20%	0.28%	0.28%	0.19%	0.04%	0.06%	0.09%	0.05%

Open in new tab

The contrast between self and hybrid tools is more visible on deletion errors. In general, all hybrid tools outperformed the non-hybrid ones (the only exception is LSC (2.64%), with higher deletion error rate than LoRMA(s) (2.51%)). Although in the hybrid ones, LoRDEC (2.15%), LSC (2.64%), LSC(t) (1.94%) and Proovread (1.51%) still showed moderate rates of deletions, all the other seven corrected outputs were able to lower the deletion error rate from 7.4% to less than 1%. Notably, HALC(s) and Proovread(t) to less than 0.2%, and NaS(⁠ $μ$

Overall, hybrid tools outperformed non-hybrid ones in terms of error rate reduction. However, the similar results obtained by both types of tools when correcting mismatches and insertions, and the contrast in correcting deletions, seem to indicate that the main advantage of hybrid correctors over self-correctors is the removal of systematic errors using Illumina data.

Homopolymer insertions are overall better corrected than deletions

In this section we further analyse homopolymers indels, i.e. insertion or deletion errors consisting of a stretch of the same nucleotide. Table 4 shows that homopolymer deletions are an order of magnitude more abundant in raw reads than homopolymer insertions. It is worth noting that, by comparing the values for the raw reads in Tables 3 and 4, homopolymers are involved in 40% of all deletions and 31% of all insertions.

Table 4

Homopolymer error rate in the raw reads and in the corrected reads for each tool, on the 1D run RNA-seq dataset, computed from 1M random aligned bases.

(A) Hybrid tools
	Raw	HALC	HALC(t)	HALC(s)	LoRDEC	LoRDEC(t)	LoRDEC(s)	LSC	LSC(t)	NaS(⁠ $μ$	PBcR(s)	Proovread	Proovread(t)
Homop. deletion	2.96%	0.28%	0.19%	0.03%	0.77%	0.63%	0.19%	0.62%	0.42%	0.02%	0.1%	0.46%	0.04%
Homop. insertion	0.38%	0.05%	0.03%	0.01%	0.09%	0.07%	0.02%	0.11%	0.09%	0.01%	0.02%	0.06%	0.01%
(B) Non-hybrid tools
	Raw	Canu	daccord	daccord(t)	LoRMA(s)	MECAT	pbdagcon	pbdagcon(t)
Homop. deletion	2.96%	2.46%	2.14%	2.05%	1.82%	2.09%	2.26%	2.27%
Homop. insertion	0.38%	0.08%	0.06%	0.03%	0.01%	0.01%	0.02%	0.01%

Open in new tab

A closer look at Table 4 reveals that hybrid error correctors outperform non-hybrid ones, as expected, mainly as homopolymer indels are likely systematic errors in ONT sequencing. Hybrid correctors correct them using Illumina reads that do not contain such biases. Moreover, all tools performed well on correcting homopolymer insertions, reducing the rate from 0.38% to less than 0.11%. In particular, the hybrid tools HALC(s), NaS(⁠ $μ$

HALC(s), NaS(⁠ $μ$

Error correction perturbs the number of reads mapping to the genes and transcripts

Downstream RNA-sequencing analyses typically rely on the number of reads mapping to each gene and transcript for quantification, differential expression analysis, etc. In the rest of the paper, we define the coverage of a gene or a transcript as the number of reads mapping to it. For short we will refer to those coverages as $C_{G}$

Figure 1

Open in new tab Download slide

Number of reads mapping to genes (⁠ $C_{G}$

We can note, as expected, that splitting a long read into several well-corrected fragments generates multiple counts, skewing up the observed coverages (see HALC(s), LoRDEC(s), PBcR(s) and LoRMA(s) in Figure 1). Therefore, users are recommended to not use this type of output for gene coverage estimation. Further, apart from the split outputs, all the other tools presented good correlation and the expected slight increase in $C_{G}$

All tools systematically presented a similar trend and lower correlation values on $C_{T}$

Error correction perturbs gene family sizes

Table 2 indicates that error correction generally results in a lower number of detected genes. In this section we explore the impact of error correction on paralogous genes. By paralogous gene family, we denote a set of paralogs computed from Ensembl (see Reference-based evaluation of long read error correction). Figure 2 represents the changes in sizes of paralogous gene families before and after correction for each tool, in terms of number of genes expressed within a given family. Overall, error correctors do not strictly preserve the sizes of gene families. Correction more often shrinks families of paralogous genes than it expands them, likely due to erroneous correction in locations that are different between paralogs. In summary, 36–87% of gene families are kept of the same size by correctors, 1–17% are expanded and 6–61% are shrunk. Supplementary Data Figure S2 shows the magnitude of expansion/shrinkage for each gene family.

Figure 2

Open in new tab Download slide

Summary of gene family size changes across error correction tools.

Error correction perturbs isoform diversity

We further investigated whether error correction introduces a bias toward the major isoform of each gene. Note that AlignQC does not directly address this question. To answer it, we computed the following metrics: number of isoforms detected in each gene before and after correction by alignment of reads to genes, coverage of lost isoforms in genes having at least two expressed isoforms and coverage of the major isoform before and after correction.

The number of isoforms varies before and after correction

Figure 3 shows the number of genes that have the same number of isoforms after correction, or a different number of isoforms $(- 3, - 2, - 1, + 1, + 2, + 3)$

Figure 3

Open in new tab Download slide

Histogram of genes having more or less isoforms after error correction.

Figure 4

Open in new tab Download slide

Histogram of isoforms that are lost after correction, in relation to their relative transcript coverage, in genes that have two or more isoforms. The $y$

Figure 5

Open in new tab Download slide

Coverage of the major isoform of each gene before and after error correction. The $x$

Multi-isoform genes tend to lose lowly expressed isoforms after correction

Figure 4 explores the relative coverage of isoforms that were possibly lost after correction, in genes having two or more expressed isoforms. The relative coverage of a transcript is the number of raw reads mapping to it over the number of raw reads mapping to its gene in total. Only the genes that are expressed in both the raw and the error-corrected reads (for each tool) are taken into consideration here. We anticipated that raw reads that map to a minor isoform are typically either discarded by the corrector, or modified in such a way that they now map to a different isoform, possibly the major one. The effect is indeed relatively similar across all correctors, except for MECAT, which tends to remove a higher fraction of minor isoforms, and LoRDEC and LSC, which tend to be the most conservatives. We can also note that trimming and splitting reads increase even further isoform losses in all tools, except for pbdagcon. This can be explained by the fact that lowly expressed isoforms possibly share regions (e.g. common exons) with highly expressed isoforms, and these shared regions are usually better corrected than regions that are unique to the lowly expressed isoforms. If read splitting then takes places, such unique regions will then be removed from the output. Even if there are variations between a highly expressed isoform $I$

Minor isoforms are corrected toward major isoforms

We define a major isoform of a gene as the isoform with the highest coverage of that gene in the raw dataset, all other isoforms are considered to be minor. To follow-up on the previous subsection, we investigate whether correctors tend to correct minor isoforms toward major isoforms. We do so by comparing the difference of coverage of the major and the minor isoforms before and after correction. In Figure 5, we observe that the coverage of the major isoform generally slightly increases after correction. The exceptions are tools that split reads (HALC(s), LoRDEC(s), PBcR(s) and LoRMA(s)), where the coverage is increased significantly, and MECAT, where the coverage decreases significantly, likely due to a feature of MECAT’s own correction algorithm. Since these 5 tools seem to heavily distort the coverage of isoforms due to aggressive splitting or filtering steps, we will focus now on the 14 other results. The slight increase of a transcript coverage after correction is expected, as already discussed in Error-correction perturbs the number of reads mapping to the genesand transcripts section; uncorrected reads that were unmapped can become mappable after correction. Therefore, the effect presented in Figure 5 could be simply due to reads being corrected to their original respective isoforms, instead of correction inducing a switch from a minor isoform to the major isoform. To verify this hypothesis, Supplementary Data Figure S4 shows that the coverage of the minor isoforms usually ‘decreases’ after correction (⁠ $R^{2} \in [0.5; 0.8]$

Figure 6

$Mapping of simulated raw and error-corrected reads to two simulated isoforms, and measurements of the percentage of reads mapping to the major isoform. The two isoforms represent an alternatively skipped exon of variable size 10, 50 and 100 bp. Left: isoform structure conservation using 100$\times $ short reads coverage and 10X long reads, using three error correction programs, one per row: LoRDEC, PBcR and proovread. Right: same with three self-correctors and 100$\times $ long reads: daccord, LoRMA and pbdagcon. Columns are alternative exon sizes. Bars are plots for each isoform ratio (50%; 75% and 90%) on the $x$-axis. On the $y$-axis, the closer a bar is to its corresponding ratio value on the $x$, the better. For instance, the bottom left light blue bar corresponds to a 50% isoform ratio with an exon of size 10, and we do not retrieve a 50% ratio after correction with Proovread (the bar does not go up to 50% on the vertical axis, but around 75% instead). The same layout applies to the right plot, where self-correctors are presented.$

Open in new tab Download slide

Mapping of simulated raw and error-corrected reads to two simulated isoforms, and measurements of the percentage of reads mapping to the major isoform. The two isoforms represent an alternatively skipped exon of variable size 10, 50 and 100 bp. Left: isoform structure conservation using 100 $\times$

Correction towards the major isoform is more prevalent when the alternative exon is small

In order to observe if particular features of alternative splicing have an impact on error correction methods, we designed a simulation over two controlled parameters: skipped exon length and isoform relative expression ratio. Using a single gene, we created a mixture of two simulated alternative transcripts: one constitutive, one exon-skipping. Several simulated read datasets were created with various relative abundances between major and minor isoform (in order to model a local differential in splicing isoform expression) and sizes of the skipped exon. Due to the artificial nature and small size of the datasets, many of the error correction methods could not be run. We thus tested these scenarii on a subset of the correction methods.

In Figure 6, we distinguish results from hybrid and self-correctors, presented with, respectively, 100 $\times$

Error correction affects splice site detection

The identification of splice sites from RNA-seq data is an important but challenging task [50]. When mapping reads to a (possibly annotated) reference genome, mapping algorithms typically guide spliced alignments using either a custom scoring function that takes into account common splices sites patterns (e.g. GT-AG), and/or a database of known junctions. With long reads, the high error rate makes precise splice site detection even more challenging, as indels (see Detailed error-rate analysis section) confuse aligners, shifting predicted spliced alignments away from the true splice sites.

Figure 7

Open in new tab Download slide

Statistics on the correctly and incorrectly mapped splice sites (abbreviated SSs) for the uncorrected (raw) and corrected reads.

Table 5

Running time and memory usage of error correction tools on the 1D run RNA-seq dataset

	HALC	LoRDEC	LSC	NaS $^{1}$	PBcR $^{2}$	Proovread	Canu	daccord $^{3}$	daccord trimmed $^{3}$	LoRMA $^{4}$	MECAT	pbdagcon $^{3}$	pbdagcon $^{3}$
Running time	3.6 h	2.4 h	99.5 h	63.2 h	116 h	107.1 h	0.7 h	6.9 h (7.4 h)	6.6 h (7.1 h)	3.4 h	0.3 h	5.7 h (6.2 h)	5.6h (6.1h)
Memory usage	60 GB	5.6 GB	2.7 GB	3 GB	166.5 GB	53.6 GB	2.2 GB	6.9 GB (27.2 GB)	6.8 GB (27.2 GB)	79 GB	9.9 GB	6.4 GB (27.2 GB)	6.4 GB (27.2 GB)

$^{1}$

Open in new tab

In this section, we evaluate how well splice sites are detected before and after error correction. Figure 7 shows the number of correctly and incorrectly mapped splice sites for the raw and corrected reads, as computed by AlignQC. One would expect that a splice site is correctly detected when little to no errors are present in reads mapping around it. Thus, as expected, the hybrid error correction tools present a clear advantage over the non-hybrid ones, as they better decrease the per-base error rate. In the uncorrected reads, 27% of the splice sites were incorrectly mapped, which is brought down to less than 1.2% in eight hybrid corrected outputs: HALC, HALC(t), HALC(s), LoRDEC(s), NaS(⁠ $μ$

Running time and memory usage of error correction tools

Table 5 shows the running time and memory usage of all evaluated tools, measured using GNU time. The running time shown is the elapsed wall clock time (in hours) and the memory usage is the maximum resident set size (in gigabytes). All tools were ran with 32 threads. Overall, all tools were able to correct the dataset within 7 hours, except for LSC, NaS, PBcR and Proovread, which took 63–116 hours, but also achieved some of the lowest post-correction error rates in Table 2 (except for LSC). In terms of memory usage, all tools required less than 10 GB of memory except for HALC, PBcR, Proovread and LoRMA, which required 53–166 GB. It is worth noting, however, that hybrid error correctors have to process massive Illumina datasets, which contributes to them taking higher CPU and memory usage for correction.

运行时间和内存使用的纠错工具
表5显示了使用GNU时间度量的所有评估工具的运行时间和内存使用情况。
显示的运行时间是经过的挂钟时间(以小时为单位)，内存使用情况是驻留的最大设置大小(以gb为单位)。
所有工具都以32个线程运行。
总的来说，除了LSC、NaS、PBcR和Proovread(耗时63-116小时)之外，所有工具都能够在7小时内对数据集进行校正，但在表2中也实现了一些最低的校正后错误率(LSC除外)。
在内存使用方面，除了HALC、PBcR、Proovread和LoRMA，所有的工具都需要小于10 GB的内存，它们需要53-166 GB。
然而，值得注意的是，混合错误校正器必须处理大量的Illumina数据集，这使得它们需要更高的CPU和内存来进行校正。

Using a different read aligner mildly but not significantly affects the evaluation

We chose GMAP (version 2017-05-08 with parameters -n 10) [51] to perform long reads mapping to the Ensembl r87 Mus Musculus unmasked reference genome in our analysis, since [49] shows it produces the best alignment results between five alignment tools. The GMAP parameters are those from the original AlignQC publication [15]. However, Minimap2 [52] is not evaluated in [49], and it is also widely used, being the default long-read mapper in several studies. In this subsection, we verify to which extent the differences between GMAP and Minimap2 can influence our evaluation. To try to highlight such differences, we chose some correctors with the worst and best performances in some measures presented in the previous analysis (made with GMAP). We thus further mapped with Minimap2 (version 2.14-r883 with parameters -ax splice) the following read datasets: (i) raw reads; (ii) LoRDEC, the corrector with the least proportion of mapped reads, but which preserves well the transcriptome diversity; (iii) LoRMA(s), the tool with the highest ratio of mapped reads and number of reads, but with the lowest mean reads length, number of detected genes and number of bases; (iv) Proovread(t), the method with the lowest error rate, and highest rate of correctly mapped splice sites; and (v) Canu, the corrector with the highest error rate and lowest rate of correctly mapped splice sites.

Table 6

GMAP and Minimap2 results across a selection of metrics for the raw, LoRDEC, LoRMA(s), Proovread(t), and Canu reads. (G) indicates GMAP results, and (M2) Minimap2 results.

Metric	raw(G)	raw(M2)	LoRDEC(G)	LoRDEC(M2)	LoRMA(s)(G)	LoRMA(s)(M2)	Proovread(t)(G)	Proovread(t)(M2)	Canu(G)	Canu(M2)
mapped reads	83.5%	86.8%	85.5%	87.4%	98.9%	99.8%	99.4%	99.1%	99.1%	100.0% $^{d}$
mapped bases $^{a}$	89.0%	91.0%	90.6%	92.6%	99.5%	99.8%	99.2%	99.3%	92.0%	93.2%
error rate $^{b}$	14.11%	13.79%	4.67%	4.65%	0.33%	0.31%	2.9%	2.73%	6.2%	5.97%
mismatches	5.26%	3.98%	2.15%	1.89%	0.15%	0.11%	0.37%	0.22%	1.22%	0.85%
deletions	7.58%	8.09%	2.17%	2.29%	0.14%	0.17%	2.51%	2.46%	4.72%	4.79%
insertions	1.26%	1.73%	0.34%	0.47%	0.04%	0.03%	0.03%	0.05%	0.26%	0.32%
nb detected genes	16.8 k	16.2 k	16.8 k	16.0 k	14.6 k	14.2 k	6.8 k	6.3 k	12.4 k	11.7 k
correctly mapped ss $^{c}$	72.97%	91.46%	97.56%	98.69%	99.71%	99.58%	96.92%	98.07%	88.03%	94.67%

$^{a}$

据AlignQC报道。
映射读之间对齐的碱基百分比，通过计算BAM文件中雪茄串的M个部分获得。
未映射读取中的碱基不计算。
由AlignQC报告的bbAs，使用来自对齐reads片段的100万个碱基的样本。
ccss代表拼接位点。
dd100.0%由于四舍五入得到。
更精确的映射读取率为99.988%。

Open in new tab

We present in Table 6 the main differences between Minimap2 and GMAP on the aforementioned correctors. We can note that Minimap2 is able to map more reads than GMAP across all tools but Proovread(t). However, when both mappers are able to map almost all reads (⁠ $\geq$

We can thus conclude that Minimap2 was slightly better on mapping reads and bases, and significantly better at correctly mapping splice sites. On the other hand, GMAP was considerably better at detecting genes. The error rate of the reads mapped by both tools are very similar, with GMAP reporting slightly more mismatches, and Minimap2 reporting slightly more deletions and insertions. However, these differences are mainly concentrated on the raw reads dataset, which is the less accurate.

The data shown and discussed in this section suggest that the main takeaway points from the results presented in the previous sections should not change when GMAP is replaced by Minimap2. The full report of the analysis of both mappers, containing further details, can be found in the support page of our method: https://leoisl.gitlab.io/LR_EC_analyser_support/.

在表6中，我们介绍了上述校正器中Minimap2和GMAP之间的主要区别。
我们可以注意到，除了Proovread(t)之外，Minimap2能够在所有工具上比GMAP映射更多的读取。
然而,当这两个映射器能够地图几乎所有读取(⁠≥≥98.9%),它们之间的映射速度差异较低(& lt; 1%)。
最大的区别是当两个映射器都不能很好地执行时，映射不到90%的读取。
在这种情况下，Minimap2做得更好，在raw和LoRDEC读取中分别比GMAP多映射3.3%和1.9%，这是显而易见的，但不是很高。
通过查看映射的基数率可以得出类似的结论。
两个映射器之间报告的错误率非常相似，在原始读取方面的最大差异是0.32%。
然而，当我们按类型细分错误时，有一个明显的区别。
GMAP报告了更多的不匹配，而Minimap2报告了更多的删除和插入。
然而，只有在原始的解读中，差异才明显。
GMAP明显优于Minimap2的指标是检测到的基因数量，确定了453个(Proovread(t))和723个(LoRDEC)之间的基因，这可以被认为是显著的。
另一方面，在正确映射拼接站点方面，Minimap2要比GMAP好得多。
这两个映射器的区别表现良好时,映射超过96.9%的接头地点,不显著(⁠≤≤1.15%)。
最大的差异可以在raw和Canu读取的拼接位点的映射上看到，Minimap2比GMAP正确对齐的拼接位点多18.49%和6.64%。

因此，我们可以得出结论，Minimap2在映射读取和碱基方面稍好一些，在正确映射剪接位点方面明显更好。
另一方面，GMAP在检测基因方面做得更好。
这两个工具映射的读的错误率非常相似，GMAP报告的不匹配稍微多一些，Minimap2报告的删除和插入稍微多一些。
然而，这些差异主要集中在原始的read数据集上，这是较不准确的。

本节中显示和讨论的数据表明，当将GMAP替换为Minimap2时，前几节中给出的结果的主要要点不应该改变。
两个映射器分析的完整报告(包含进一步的详细信息)可以在我们方法的支持页面中找到:https://leoisl.gitlab.io/LR_EC_analyser_support/。

Analysing human Nanopore direct RNA-sequencing data

We further analysed a human Nanopore ‘direct RNA’-sequencing dataset from the Nanopore-WGS-Consortium (see Section Evaluation datasets for details). Since there was no corresponding Illumina sequencing for this dataset, we were able to evaluate only the non-hybrid error correction tools. Moreover, although LoRMA could be successfully executed, AlignQC could not process its output so we removed LoRMA from the evaluation. Table 7 presents some main statistics of non-hybrid error correction tools on the aforementioned dataset. In the rest of this section, we highlight the major differences and similarities between cDNA and direct RNA datasets, keeping in mind that they were performed on two different species (mouse and human, respectively).

分析人类纳米孔直接RNA测序数据
我们进一步分析了来自Nanopore- wgs - consortium的人类纳米孔“直接RNA”测序数据集(详见评估数据集部分)。
由于该数据集没有相应的Illumina测序，我们只能评估非混合误差修正工具。
此外，虽然LoRMA可以成功执行，但AlignQC无法处理其输出，因此我们将LoRMA从评估中删除。
表7给出了关于上述数据集的非混合误差修正工具的一些主要统计数据。
在本节的其余部分，我们强调cDNA和直接RNA数据集之间的主要差异和相似之处，记住他们是在两个不同的物种(小鼠和人类，分别)上进行的。

Table 7

Statistics of non-hybrid error correction tools on one human Nanopore direct RNA-sequencing data from the Nanopore-WGS-Consortium

	Raw	Canu	daccord	daccord(t)	MECAT	pbdagcon	pbdagcon(t)
nb reads	894 k	533 k	744 k	792 k	224 k	772 k	771 k
mapped reads	76.0%	99.2%	97.5%	98.2%	99.2%	98.6%	98.5%
mean length	739	917	849	755	1095	666	675
nb of bases	587 M	484 M	620 M	591 M	243 M	510 M	516 M
error rate $^{a}$	4.61%	7.61%	6.03%	5.47%	6.36%	8.1%	8.26%
mismatches	5.79%	2.12%	1.56%	1.3%	0.7%	1.19%	1.18%
deletions	5.95%	4.77%	4.13%	3.9%	5.5%	6.79%	6.94%
insertions	2.87%	0.72%	0.34%	0.27%	0.17%	0.12%	0.14%
homop. deletions	2.18%	2.01%	1.84%	1.81%	2.24%	2.18%	2.22%
homop. insertions	1.13%	0.3%	0.14%	0.11%	0.09%	0.06%	0.07%
nb detected genes	14.1 k	8.8 k	12.3 k	10.8 k	6.3 k	10.0 k	10.1 k
correctly mapped ss $^{b}$	76.95%	87.97%	92.52%	93.94%	92.86%	91.16%	91.14%

$^{a}$

Open in new tab

In general, self-correctors discarded more reads on the direct RNA dataset than on the 1D cDNA dataset. Daccord(t) discarded the least number of reads (102 k), while Canu and MECAT discarded a considerable amount of reads (361 k and 670 k, respectively), due to post-correction filtering. Due to our choice of parameters, the shortest reads in Canu and MECAT outputs were of lengths 101 and 100 bases, respectively. However, the removal of shorter reads explains only a fraction of the read losses, as the raw dataset contains only $\sim$

The error rate of the raw direct RNA reads was 14.61%, 0.89% higher than in the raw 1D cDNA reads. As expected, the error rates in all tools were also higher in the direct RNA dataset, leading to worse results. The largest difference is with pbdagcon(t), where the error rate after correction of direct RNA reads is 2.55% higher than on the 1D cDNA dataset. The distribution of errors in the direct RNA dataset was more balanced, with mismatches and deletions having almost the same representation (around 5.85%), but insertions still being less represented (2.87%). The correction behaviour of the tools is similar across both datasets: insertion is the best corrected type of error, followed by mismatches, with satisfactory results, and deletions, in which the methods overall did a poor job. In particular, pbdagcon and pbdagcon(t) even ‘increased’ the deletion error rate by 0.84% and 0.99%, respectively. The behaviour was similar on correcting homopolymer errors; homopolymer deletions were poorly corrected, with MECAT, pbdagcon and pbdagcon(t) not reducing the homopolymer deletion error rate at all, while homopolymer insertions were well corrected. The number of detected genes in the raw direct RNA dataset (14.1 k) is less than in the raw 1D cDNA dataset (16.8 k), although this is consistent with the difference in human/mouse genes count. Moreover, the tools also lose more genes in the direct RNA dataset. In particular, MECAT loses 6.4 k genes in the 1D cDNA dataset, and 7.8 k in the direct RNA dataset. The rate of correctly mapped splice sites was slightly higher in the raw direct RNA reads (76.95% versus 72.94%), but in the error-corrected reads, this rate was highly similar (the largest difference was 0.44% in the daccord(t) correction).

As a result of our evaluation, and in accordance with the cDNA analysis, care should be taken when applying self-correctors to remove errors from Nanopore direct RNA-seq data. For example, Canu and MECAT present the undesirable side effect of discarding a lot of input reads, thus reducing the amount of information in the long reads, and decreasing considerably the number of detected genes. Although the tools perform well at correcting mismatches and insertions, they have trouble correcting deletions. In particular, MECAT, pbdagcon and pbdagcon(t) perform rather poorly, with the last two even increasing the deletion error rate. Nonetheless, all tools manage to decrease the error rate so that the large majority of reads are mappable, increasing the proportion of mapping from 76% in the raw reads to at least 97.5%. If very low error rate (<1%) must be achieved due to requirements of a downstream application, then it seems that using hybrid correction tools is a must.

The full report of the analysis output by our method on this dataset, containing further details, can be found at https://leoisl.gitlab.io/LR_EC_analyser_support/.

通常，自校正器在直接RNA数据集上丢弃的读数比在1D cDNA数据集上丢弃的读数多。
Daccord(t)丢弃的读数最少(102 k)，而Canu和MECAT由于后校正滤波，丢弃了相当多的读数(分别为361 k和670 k)。
由于我们对参数的选择，Canu和MECAT输出的最短读取分别为101和100个碱基。
然而，删除较短读取只解释了读取损失的一小部分，因为原始数据集只包含~ ~ ~ 59k读取，小于101个碱基。
在直接RNA数据集中，GMAP映射的原始读区比例较小(为76%，而1D cDNA数据集中为83.5%)，但另一方面，错误修正读区映射率普遍较高。
值得注意的是，97.5%的daccord读数(resp)。
daccord(t)的比例为98.2%，而非92.5% (resp)。
(94%)在1D cDNA数据集中。
的平均长度修正读取原始的平均长度相比,阅读也是一般高于直接RNA数据集,它翻译成工具有一个数量的输出基地更类似于原始阅读基地的数量在这个数据集,MECAT除外。

原始直接RNA reads的错误率为14.61%，比原始1D cDNA reads的错误率高0.89%。
正如预期的那样，在直接RNA数据集中，所有工具的错误率也更高，从而导致更糟糕的结果。
最大的不同是与pbdagcon(t)，其中直接RNA reads校正后的错误率比1D cDNA数据集的错误率高2.55%。
在直接RNA数据集中，错误的分布更加平衡，不匹配和缺失几乎有相同的表示(约5.85%)，但插入仍然较少表示(2.87%)。
这些工具在两个数据集上的纠正行为是相似的:插入是纠正错误的最佳类型，其次是不匹配，得到满意的结果，以及删除，在这方面，这些方法总体上表现不佳。
特别地，pbdagcon和pbdagcon(t)甚至还分别“增加”了0.84%和0.99%的删除错误率。
在修正均聚物误差时，其行为相似;
均聚物缺失的纠正很差，MECAT、pbdagcon和pbdagcon(t)完全没有降低均聚物缺失错误率，而均聚物插入的纠正很好。
在原始的直接RNA数据集(14.1 k)中检测到的基因数量低于原始的1D cDNA数据集(16.8 k)，尽管这与人/小鼠基因数量的差异是一致的。
此外，这些工具还会在直接的RNA数据集中丢失更多的基因。
特别是，MECAT在1D cDNA数据集中丢失了6.4 k基因，在direct RNA数据集中丢失了7.8 k基因。
原始直接RNA reads中剪接位点的正确映射率略高(76.95%对72.94%)，但在错误纠正的reads中，这一比率高度相似(最大的差异是daccord(t)校正的0.44%)。

根据我们的评估，并与cDNA分析一致，在使用自校正器去除纳米孔直接rna测序数据中的误差时，应谨慎使用。
例如，Canu和mecu目前的不良副作用是丢弃大量的输入读取，从而减少了长读取中的信息量，大大减少了检测到的基因的数量。
尽管这些工具在纠正不匹配和插入方面表现良好，但它们在纠正删除方面存在困难。
特别是MECAT、pbdagcon和pbdagcon(t)的性能相当差，最后两个甚至增加了删除错误率。
尽管如此，所有的工具都设法降低了错误率，因此大多数读取都是可映射的，将原始读取中的映射比例从76%提高到至少97.5%。
如果由于下游应用程序的要求，必须实现非常低的错误率(1%)，那么使用混合修正工具似乎是必须的。

我们的方法对该数据集的分析输出的完整报告(包含进一步的细节)可以在https://leoisl.gitlab.io/LR_EC_analyser_support/找到。

Discussion

This work shed light on the versatility of long-read DNA error correction methods, which can be successfully applied to error correction of Nanopore RNA-sequencing data as well. In our tests, error rates can be reduced from 13.7% in the original reads down to as low as 0.3% in the corrected reads. This is perhaps an unsurprising realization as the error correction of RNA-sequencing data presents similarities with DNA-sequencing; however a collection of caveats are described in the Results 2 section. Most importantly, the number of genes detected by alignment of corrected reads to the genome was reduced significantly by several error correction methods, mainly the non-hybrid ones. Furthermore, depending on the method, error correction results have a more or less pronounced bias toward correction to the major isoform for each gene, jointly with a loss of the most lowly expressed isoforms. We provided a software that enables automatic benchmarking of long-read RNA-sequencing error correction software, in the hope that future error correction methods will take advantage of it to avoid biases.

Detailed error rate analysis showed that while hybrid correctors have lower error rates than self-correctors, the latter achieved comparable performance to the former in correcting substitutions and insertions. Deletions appear to be caused by systematic sequencing errors of the Nanopore technology, making them fundamentally hard (or even impossible) to address in a self-correction setting. Moreover PBcR, NaS and Proovread are one of the most resource-intensive error correction tools, but also are some of the few correctors able to reduce the base error rate below 0.7%. The only exception to this is HALC, which presents a low running time, and <0.5% error rate in its split output.

We observe that hybrid correctors were able to better preserve the number of detected genes than self-correctors. The large majority of the hybrid corrections (9/12) were able to identify an amount of genes similar to the raw reads, with only NaS(⁠ $μ$

As a result, we conclude that no evaluated corrector outperforms all the others across all metrics and is the most suited in all situations, and the choice should be guided by the downstream analysis, yet hybrid correction tools generally outperformed the self-correctors. For quantification, we have shown that error correction introduces undesirable coverage biases, as per Section Error correction perturbs the number of readsmapping to the genes and transcripts; therefore we would recommend avoiding this step altogether. For isoform detection, HALC, LoRDEC, LSC and Proovread (only in full-length mode) appear to be the methods of choice as they result in the highest number of detected genes in Table 2 and also preserve the number of detected isoforms as per Section Error correction perturbs isoform diversity. If Illumina reads are not available, then daccord (only in full-length mode) seems to be the best choice. For splice site detection, we recommend using hybrid correctors, preferably HALC, NaS, PBcR or Proovread, as per Section Error correction affects splice site detection. The same four tools (however, HALC should be in split mode, and Proovread, in trimmed mode) are also recommended if downstream analyses require very low general error rate. Finally for all other applications, we make some general recommendations. A reasonable balance appears to be achieved by tools with no unsatisfactory values in Table 2: HALC(t), NaS(⁠ $μ$

Our analyses relied on a single mapping software (GMAP [51]) to align raw and error-corrected reads, as in previous benchmarks [21, 49]. However, we were also able to verify that Minimap2 [52], another widely used mapper, produces similar results than GMAP (see Section Using a different read alignermildly but not significantly affects the evaluation), and thus the main messages of the analyses presented in this paper should not change by replacing GMAP by Minimap2.

As a side note, AlignQC reports that raw reads contained 1% of chimeric reads, i.e. either portions of reads that align to different loci, or to overlapping loci. The number of chimeric reads after error correction remains in the 0.7–1.3% range except for LoRDEC(s) (0.2%), PBcR(s) (0.1%), Proovread(t) (0.1%), LoRMA(s) (0.04%) and MECAT (0.2%), which either correctly split reads or discarded chimeric ones. We observe that HALC (4.2%), HALC(t) (3.9%) and daccord (1.7%) increased considerably the proportion of chimeric reads in the output.

Furthermore, we focused our evaluation on a single technology: Nanopore. We did an extensive analysis of 1D cDNA Nanopore data, using Illumina data for hybrid correction. We also performed a brief analysis of Nanopore direct RNA-seq data. While it would be natural to also evaluate PacBio data, we note that data from the PacBio Iso-Seq protocol is of different nature as the reads are pre-corrected by circular consensus.

In the evaluation of tools, we did not record the disk space used by each method, yet we note that it may be a critical factor for some tools (e.g. Canu) on larger datasets. We note also that genes that have low Illumina coverage are unlikely to be well corrected by hybrid correctors. Therefore our comparison does not take into account differences in coverage biases between Illumina and Nanopore data, which may benefit self-correctors. Finally, transcript and gene coverages are derived from the number of long reads aligning to a certain gene/transcript. This method enables to directly relate the results of error correction to transcript/gene counts, but we note that in current RNA-seq analysis protocols, transcript/gene expression is still generally evaluated using short reads.

讨论
这项工作揭示了长时间读取的DNA纠错方法的通用性，这种方法也可以成功地应用于纳米孔rna测序数据的纠错。
在我们的测试中，错误率可以从最初读取的13.7%降低到修改后读取的0.3%。
这也许是一个不足为奇的认识，因为rna测序数据的错误校正表现出与dna测序的相似性;
然而，在结果2节中描述了一些注意事项。
最重要的是，通过几种错误校正方法(主要是非杂交方法)，将校正后的reads对准基因组，检测到的基因数量显著减少。
此外，根据不同的方法，错误校正结果对每个基因的主要亚型的校正有或多或少的明显偏向，同时还会损失表达最低的亚型。
我们提供了一种软件，能够对长时间读取的rna测序纠错软件进行自动基准测试，希望未来的纠错方法能够利用它来避免偏差。

详细的错误率分析表明，虽然混合校正器的错误率比自校正器低，但自校正器在校正替换和插入方面的性能与混合校正器相当。
缺失似乎是由纳米孔技术的系统测序错误造成的，这使得在自校正设置中很难(甚至不可能)处理它们。
此外，PBcR、NaS和Proovread是最耗费资源的纠错工具之一，但也是少数能够将基本错误率降低到0.7%以下的纠错工具之一。
唯一的例外是HALC，它的运行时间很低，分割输出的错误率为0.5%。

我们观察到，杂交校正器比自校正器能够更好地保存检测到的基因数量。
混合动力修正的绝大多数(9/12)能够识别一个基因类似于原始的读取,只有NaS(⁠μμ⁠),PBcR (s)和Proovread (t)不太敏感,但仍获得合理的结果。
另一方面，daccord是唯一达到上述三种杂交工具相同基因鉴定水平的自校正工具，其他工具严重截断了转录组多样性。
HALC、LoRDEC、LSC、Proovread(仅在全长模式下)和daccord(仅在全长模式下)似乎也比其他校正器更好地保留了检测到的异型的数量(Section isoform的数量在校正前后有所不同)。
所有的工具在校正后往往会丢失低表达的亚型(切片多亚型基因在校正后往往会丢失低表达的亚型)。
一些工具也倾向于将小亚型更正为主要亚型(部分小亚型更正为主要亚型)，主要是当它们之间的差异很小时(当可选外显子很小时，主要亚型的部分更正更为普遍)。
这些点是预期的，因为大多数工具主要是为处理非异构覆盖的DNA数据而定制的。
此外，混合校正器在校正接合点附近的误差方面优于自校正器(截面误差校正影响接合点检测)。

因此，我们得出结论，在所有指标中，没有一个被评估的校正工具优于所有其他校正工具，并且在所有情况下都是最适合的，并且选择应该以下游分析为指导，然而混合校正工具的表现通常优于自校正工具。对于量化，我们已经证明了错误校正引入了不希望的覆盖偏差，因为根据章节错误校正会干扰到基因和转录本的读写数量；因此，我们建议完全避免这一步骤。对于异构体检测，HALC、LoRDEC、LSC和Proovread（仅在全长模式下）似乎是选择的方法，因为它们导致表2中检测到的基因数量最多，并且根据错误校正干扰亚型多样性一节保留检测到的亚型的数量。如果Illumina reads不可用，那么daccord（仅在全长模式下）似乎是最佳选择。对于拼接位点检测，我们建议使用混合校正器，最好是HALC、NaS、PBcR或Proovread，因为纠错影响拼接位点检测。如果下游分析要求非常低的一般错误率，也建议使用相同的四种工具（但是，HALC应处于拆分模式，Proovread应处于修剪模式）。最后，对于所有其他应用，我们提出一些一般性建议。合理的平衡似乎是通过表2中没有不满意值的工具实现的：HALC（t）、NaS（μμ）和Proovread（t）。如果Illumina reads不可用，那么最好的整体自校正器似乎是daccord（t）、pbdagcon和pbdagcon（t）。此外，修剪和分割通常可以提高作图率和读取精度，但会导致原始读取中存在的信息丢失，使正确识别基因变得复杂。

我们的分析依赖于一个单一的映射软件（GMAP[51]）来对齐原始和错误修正后的读数，就像以前的基准测试一样[21,49]。然而，我们也能够验证另一个广泛使用的映射器Minimap2[52]产生的结果与GMAP相似（参见使用不同的读取对齐方式，但不会显著影响评估结果），因此，本文中提出的分析的主要信息不应因Minimap2代替GMAP而改变。
作为旁注，AlignQC报告说原始阅读包含1%的嵌合阅读，也就是说，阅读的一部分与不同的基因座或重叠的基因座对齐。纠错后的嵌合体读取数保持在0.7-1.3%的范围内，除了LoRDEC（s）（0.2%）、PBcR（s）（0.1%）、Proovread（t）（0.1%）、LoRMA（s）（0.04%）和MECAT（0.2%）之外，它们要么正确地分割读取，要么丢弃嵌合读。我们观察到HALC（4.2%）、HALC（t）（3.9%）和daccord（1.7%）显著增加了输出中嵌合体读取的比例。
此外，我们的评估集中在一种技术上：纳米孔。我们使用Illumina数据进行杂交校正，对1D cDNA纳米孔数据进行了广泛的分析。我们还对纳米孔直接RNA序列数据进行了简要分析。虽然评估PacBio数据也是很自然的，但我们注意到PacBio Iso-Seq协议中的数据具有不同的性质，因为读取是通过循环共识预先修正的。
在评估工具时，我们没有记录每种方法所使用的磁盘空间，但我们注意到，对于某些工具（如Canu）而言，这可能是一个关键因素。我们还注意到，低照度覆盖率的基因不太可能被杂交校正剂很好地校正。因此，我们的比较没有考虑到Illumina和纳米孔数据之间的覆盖偏差差异，这可能有利于自校正。最后，转录本和基因覆盖率由与某个基因/转录本对齐的长读次数得出。这种方法可以直接将纠错结果与转录/基因计数联系起来，但我们注意到，在当前的RNA序列分析协议中，转录/基因表达仍然通常使用短读来评估。

Methods

Nanopore library preparation and sequencing

RNA MinION sequencing cDNA were prepared from four aliquots (250ng each) of mouse commercial total RNA (brain, Clontech, Cat# 636601), according to the Oxford Nanopore Technologies (Oxford Nanopore Technologies Ltd, Oxford, UK) protocol ‘1D cDNA by ligation (SQK-LSK108)’. The data generated by MinION software (MinKNOW 1.1.21, Metrichor 2.43.1) were stored and organized using a Hierarchical Data Format. FASTA reads were extracted from MinION HDF files using poretools [53]. We obtained 1 256 967 Nanopore 1D reads representing around 2 Gbp of data with an average size of 1650 bp and a N50 of 1885 bp. These reads were then filtered in silico to remove mtRNA and rRNA using BLAT [54] and est2genome [55], obtaining 740 776 long reads.

纳米孔库的制备和测序
根据牛津Nanopore Technologies (Oxford Nanopore Technologies Ltd, Oxford, Oxford, UK)的协议“1D cDNA ligation (SQK-LSK108)”，从4个样本(brain, Clontech, Cat# 636601)中制备RNA MinION测序cDNA。
由MinION软件(MinKNOW 1.1.21, Metrichor 2.43.1)生成的数据使用分层数据格式存储和组织。
FASTA读取是使用poretools[53]从MinION HDF文件中提取的。
我们获得了1 256 967个纳米孔1D reads，代表大约2 Gbp的数据，平均大小为1650 bp, N50为1885 bp。
然后使用BLAT[54]和est2genome[55]对这些reads进行硅过滤，去除mtRNA和rRNA，获得740 776个长reads。

Illumina library preparation and sequencing

RNA-Seq library preparations were carried out from 500 ng total RNA using the TruSeq Stranded mRNA kit (Illumina, San Diego, CA, USA), which allows mRNA strand orientation (sequence reads occur in the same orientation as anti-sense RNA). After quantification by qPCR, each library was sequenced using 151 bp paired end reads chemistry on a HiSeq4000 Illumina sequencer.

Illumina文库的准备和测序
使用TruSeq Stranded mRNA试剂盒(Illumina, San Diego, CA, USA)从500 ng总RNA中制备RNA- seq文库，该试剂盒允许mRNA的链定向(序列读值与反义RNA的方向相同)。
qPCR定量后，用HiSeq4000 Illumina测序仪对每个文库进行测序，测序长度为151 bp。

Reference-based evaluation of long read error correction

A tool coined LR_EC_analyser, available at https://gitlab.com/leoisl/LR_EC_analyser, was developed using the Python language to analyse the output of long reads error correctors. The required arguments are the BAM files of the raw and corrected reads aligned to a reference annotated genome, as well as the reference genome in Fasta file format and the reference annotation in GTF file format. A file specifying the paralogous gene families can also be provided if plots on gene families should be created. In our main analysis, gene families were computed by selecting all paralogs from Ensembl r87 mouse genes with 80%+ identity. Note that paralogs from the same family may have significantly different lengths, and no threshold was applied with respect to coverage. The complete selection procedure is reported here: https://gitlab.com/leoisl/LR_EC_analyser/blob/master/GettingParalogs.txt. The main processing of our method involves running the AlignQC software [21] (https://github.com/jason-weirather/AlignQC) on the input BAMs and parsing its output to create custom plots. It then aggregates information into a HTML report. For example, Tables 2–4 are compilations from AlignQC results, as well as Tables 6 and 7, and Figure 7. Figures 1–5 were created processing text files built by AlignQC called ‘Raw data’ in their output. In addition, an in-depth gene and transcript analysis can be performed using the IGV.js library (https://github.com/igvteam/igv.js) [56, 57]. In this paper, we did not include all plots and tables created by the tool. To visualize the full latest reports, visit https://leoisl.gitlab.io/LR_EC_analyser_support/.

基于参考的长读纠错评价
使用Python语言开发了一个名为LR_EC_analyser的工具，用于分析长读取错误纠正器的输出，该工具可在https://gitlab.com/leoisl/LR_EC_analyser获得。
所需的参数是与引用注释的基因组对齐的原始和修正读取的BAM文件，以及Fasta文件格式的引用基因组和GTF文件格式的引用注释。
如果需要建立基因家族的图谱，也可以提供一个文件来指定子基因家族。
在我们的主分析中，通过从Ensembl r87小鼠基因中选择所有同源性为80%+的paralog来计算基因家系。
请注意，来自同一家族的paralog可能具有显著不同的长度，并且没有对覆盖率应用阈值。
完整的选择过程在此报告:https://gitlab.com/leoisl/LR_EC_analyser/blob/master/GettingParalogs.txt。
我们的方法的主要处理包括在输入BAMs上运行AlignQC软件[21](https://github.com/jason-weirather/AlignQC)并解析其输出以创建定制的图形。
然后它将信息聚合到一个HTML报告中。
例如，表2-4是根据AlignQC结果编译的，还有表6和表7，以及图7。
图1-5是在输出中处理AlignQC构建的名为“原始数据”的文本文件时创建的。
此外，可以使用IGV.js库(https://github.com/igvteam/igv.js)进行深入的基因和转录本分析[56,57]。
在本文中，我们没有包括该工具创建的所有图形和表。
要查看完整的最新报告，请访问https://leoisl.gitlab.io/LR_EC_analyser_support/。

Simulation framework for biases evaluation

In the simulation framework of Section Correction towards the major isoform is more prevalent when the alternative exon is small, exons length and number were chosen according to resemble what is reported in eukaryotes [58] (8 exons, 200 nucleotides). A skipped exon, whose size can vary, was introduced in the middle of the inclusion isoform. Skipped exon can have a size of 10, 50 or 100 nt. We also allowed the ratio of minor/major isoforms (⁠ $M / m$

During the simulation, we produced two versions of each read. The reference read is the read that represents exactly its isoform, without errors. The uncorrected read is the one in which we introduced errors. We used an error rate and profile that mimics observed R9.4 errors in ONT reads (total error rate of $\sim$

We mapped the corrected reads on both exclusion and inclusion reference sequences using a fast Smith–Waterman implementation [59], from which we obtained a SAM file. It is expected that exclusion corrected reads will map on exclusion reference with no gaps, and that a deletion of the size of the skipped exon will be reported when mapping them to the inclusion. For each read, if it could be aligned to one of the two reference sequences in one block (according to the CIGAR), then we assigned it to this reference. If more blocks were needed, we assigned the read to the reference sequence with which the cumulative length of gaps is the lowest. We also reported the ratio between corrected reads size of each isoform kind and the real expected size of each reference isoform.

偏差评估的模拟框架
在模拟框架中，当备选外显子比较小时，针对主要异构体的部分校正更为普遍，外显子的长度和数量是根据真核生物[58]中报道的类似情况来选择的(8个外显子，200个核苷酸)。
被跳过的外显子，其大小可以变化，被引入到包含异构体的中间。
跳过外显子可以有一个大小为10、50或100元。我们也允许小的比例/主要亚型(⁠M / mM / M⁠)有所不同。
CC的覆盖率和M / mM / M⁠比率,读取来自主要的同种型的数量密度和小同种型读取的数量密度⁠。
我们选择相对丰度比值等夹杂物对碘氧基苯甲醚:90/1090/10⁠,75/2575/25和50/5050/50⁠。
所有的读数都应该表示全长亚型。
最后，混合校正输入沿各亚型模拟长度为150的短读，覆盖范围分别为10倍和100倍。

在模拟过程中，我们为每次读取生成两个版本。
引用读取是准确地表示其异构体而没有错误的读取。
未更正的读数是我们引入错误的读数。
我们使用了一个错误率和配置，模拟观察到ONT读取中的R9.4错误(总的错误率为~ ~ 13%，分解为替换的~ ~ 5%，插入的~ ~ 1%，删除的~ ~ 7%)。
将每个校正器应用到读集后，我们获得了一个三组(参考、未校正、已校正)的读数，我们使用该读数在几个标准下评估校正的质量。

我们使用快速的Smith-Waterman实现[59]将校正后的读取同时映射到排除和包含引用序列上，从中我们获得了一个SAM文件。
预期排除校正的读将映射到没有间隙的排除引用上，并且在映射到包含时将报告跳过的外显子大小的删除。
对于每次读取，如果它可以在一个块(根据雪茄)中与两个引用序列中的一个对齐，那么我们就将它分配给这个引用。
如果需要更多的块，我们将read分配给引用序列，在引用序列中间隔的累积长度最小。
我们还报告了每种isoform类型的校正读取大小和每个参考isoform的实际期望大小之间的比率。

Key Points

Long-read transcriptome sequencing is hindered by high error rates that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and the creation of gene catalogues.
This review evaluates the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads.
Existing tools significantly lower the error rate, but they also significantly perturb gene family sizes and isoform diversity.

要点

长read的转录组测序受到高错误率的阻碍，这些错误率会影响分析，如鉴定异构体、外显子边界、开放阅读框和基因目录的创建。

这篇综述评估了现有的长读DNA纠错方法能够纠正cDNA纳米孔读错的程度。

现有的工具显著地降低了错误率，但它们也显著地扰乱了基因家族的大小和亚型多样性。

Competing interests

JMA is one of the authors of the NaS error correction tool [29].

However, this study was designed and performed with no bias toward this particular tool.

JMA is part of the MinION Access Programme and received travel and accommodation expenses to speak at Oxford Nanopore Technologies conferences.

JMA是NaS纠错工具[29]的作者之一。

然而，本研究的设计和执行并没有偏向于这个特殊的工具。

JMA是小黄人访问计划的一部分，并接受了在牛津纳米孔技术会议上发言的旅行和住宿费用。

Acknowledgments

The authors thank the ASTER consortium, in particular Camille Sessegolo and Vincent Lacroix, for fruitful discussions. L.L. acknowledges CNPq/Brazil for the support. This work was performed using the computing facilities of the CC LBBE/PRABI and the France Génomique e-infrastructure (ANR-10-INBS-09-08).

Funding

This work was supported by the French National Research Agency [ANR ASTER, ANR-16-CE23-0001]; the INCEPTION project [PIA/ANR-16-CONV-0005]; and the Brazilian Ministry of Science, Technology and Innovation (in Portuguese, Ministério da Ciência, Tecnologia e Inovação—MCTI) through the National Counsel of Technological and Scientific Development (in Portuguese, Conselho Nacional de Desenvolvimento Científico e Tecnológico—CNPq), under the Science Without Borders (in Portuguese, Ciências Sem Fronteiras) scholarship grant [process number 203362/2014-4 to L.L.].

All authors are part of the ASTER project (ANR ASTER) with the purpose of developing algorithms and software for analysing third-generation sequencing data.

wangchuang2017

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

长读纠错软件应用于纳米孔RNA测序数据的比较评估

Abstract

Introduction

Results

Error correction tools

Evaluation datasets

Error correction improves base accuracy and splits, trims or entirely removes reads

Error correction facilitates mapping yet generally lowers the number of detected genes

Detailed error rate analysis

Deletions are the most problematic sequencing errors

Homopolymer insertions are overall better corrected than deletions

Error correction perturbs the number of reads mapping to the genes and transcripts

Error correction perturbs gene family sizes

Error correction perturbs isoform diversity

The number of isoforms varies before and after correction

Multi-isoform genes tend to lose lowly expressed isoforms after correction

Minor isoforms are corrected toward major isoforms

Correction towards the major isoform is more prevalent when the alternative exon is small

Error correction affects splice site detection

Running time and memory usage of error correction tools

Using a different read aligner mildly but not significantly affects the evaluation

Analysing human Nanopore direct RNA-sequencing data

Discussion

Methods

Nanopore library preparation and sequencing

Illumina library preparation and sequencing

Reference-based evaluation of long read error correction

Simulation framework for biases evaluation

Competing interests

Acknowledgments

Funding

References

导航

公告