2024-08-29-SEA-RAFT-中英对照

英文题目 SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow
中文名称 SEA-RAFT:简单、高效、准确的光流RAFT算法
发表时间 2024年5月23日
平台 ECCV 2024
作者 Yihan Wang, Lahav Lipson, and Jia Deng
邮箱 {yw7685, llipson, jiadeng}@princeton.edu
来源 普林斯顿大学计算机科学系
关键词 光流估计
paper && code && video paper code

Abstract

We introduce SEA-RAFT, a more simple, efficient, and accurate RAFT for optical flow. Compared with RAFT, SEA-RAFT is trained with a new loss (mixture of Laplace). It directly regresses an initial flow for faster convergence in iterative refinements and introduces rigid-motion pre-training to improve generalization. SEA-RAFT achieves state-of-the-art accuracy on the Spring benchmark with a 3.69 endpoint-error (EPE) and a 0.36 1-pixel outlier rate (1px), representing \({22.9}\%\) and \({17.8}\%\) error reduction from best published results. In addition, SEA-RAFT obtains the best cross-dataset generalization on KITTI and Spring. With its high efficiency,SEA-RAFT operates at least \({2.3} \times\) faster than existing methods while maintaining competitive performance.

我们介绍了SEA-RAFT,一种更简单、高效和准确的光流RAFT算法。与RAFT相比,SEA-RAFT采用了一种新的损失函数(拉普拉斯混合)进行训练。它直接回归初始流,以加快迭代细化中的收敛速度,并引入刚体运动预训练以提高泛化能力。SEA-RAFT在Spring基准测试中达到了最先进的准确性,端点误差(EPE)为3.69,1像素异常率(1px)为0.36,代表了\({22.9}\%\)\({17.8}\%\)的误差减少。此外,SEA-RAFT在KITTI和Spring数据集上获得了最佳的跨数据集泛化性能。凭借其高效率,SEA-RAFT比现有方法至少快\({2.3} \times\),同时保持了竞争性能。

1. Introduction

Optical flow is a fundamental task in low-level vision and aims to estimate per-pixel 2D motion between video frames. It is useful for various downstream tasks Sincluding action recognition \(\left\lbrack {{39},{49},{67}}\right\rbrack\) ,video in-painting \(\left\lbrack {{10},{22},{60}}\right\rbrack\) ,frame interpolation \(\left\lbrack {{15},{27},{61}}\right\rbrack ,3\mathrm{D}\) reconstruction and synthesis \(\left\lbrack {{33},{69}}\right\rbrack\) .

光流是低级视觉中的一个基本任务,旨在估计视频帧之间的每像素2D运动。它在包括动作识别\(\left\lbrack {{39},{49},{67}}\right\rbrack\)、视频修复\(\left\lbrack {{10},{22},{60}}\right\rbrack\)、帧插值\(\left\lbrack {{15},{27},{61}}\right\rbrack ,3\mathrm{D}\)重建和合成\(\left\lbrack {{33},{69}}\right\rbrack\)等多种下游任务中非常有用。

Although traditionally formulated as an optimization problem \(\left\lbrack {5,{13},{62}}\right\rbrack\) ,almost all recent methods are based on deep learning \(\lbrack 6,8,{11},{14},{24},{29},{42} - {45},{48},{50}\) , 54-57,63,66,68]. In particular, many state-of-the-art methods [14,29,43,44,50,66] have adopted architectures based on RAFT [50], which uses a recurrent network to iteratively refine a flow field.

尽管传统上被表述为一个优化问题\(\left\lbrack {5,{13},{62}}\right\rbrack\),但几乎所有最近的方法都基于深度学习\(\lbrack 6,8,{11},{14},{24},{29},{42} - {45},{48},{50}\),54-57,63,66,68。特别是,许多最先进的方法[14,29,43,44,50,66]采用了基于RAFT[50]的架构,该架构使用循环网络来迭代细化流场。

In this paper, we introduce SEA-RAFT, a new variant of RAFT that is more efficient and accurate. When compared against all existing approaches, SEA-RAFT has the best accuracy-efficiency Pareto frontier (Fig. 1):

在本文中,我们介绍了 SEA-RAFT,这是一种 RAFT 的新变体,效率更高且更准确。与所有现有方法相比,SEA-RAFT 具有最佳的准确性-效率帕累托前沿(图 1):

  • Accuracy: On Spring [35], SEA-RAFT achieves a new state of the art, outperforming the next best by a large margin: \({18}\%\) error reduction on \(1\mathrm{{px}}\) -outlier rate (3.686 vs. 4.482) and \({24}\%\) error reduction on endpoint-error (0.363 vs. 0.471). On Sintel [3] and KITTI [36], it outperforms all other methods that have similar computational costs.

  • 准确性:在 Spring [35] 上,SEA-RAFT 达到了新的技术水平,大幅超越了次优方法:\({18}\%\)\(1\mathrm{{px}}\) 异常率上的误差减少(3.686 对比 4.482)和在端点误差上的 \({24}\%\) 误差减少(0.363 对比 0.471)。在 Sintel [3] 和 KITTI [36] 上,它优于所有具有相似计算成本的其他方法。

  • Efficiency: On each benchmark tested,SEA-RAFT runs at least \({2.3} \times\) faster than existing methods that have comparable accuracy. Our smallest model,

  • 效率:在每个测试基准上,SEA-RAFT 的运行速度至少比具有可比准确性的现有方法快 \({2.3} \times\)。我们最小的模型,

Fig. 1: Zero-shot performance of SEA-RAFT and existing methods on the Spring [35] training split. Latency is measured on an RTX3090 with a batch size of 1 and input resolution \({540} \times {960}\) . SEA-RAFT has an accuracy close to the best one achieved by MS-RAFT+ [19] but is \({11} \times\) smaller and \({24} \times\) faster.

图 1:SEA-RAFT 和现有方法在 Spring [35] 训练集上的零样本性能。延迟是在具有批量大小为 1 和输入分辨率 \({540} \times {960}\) 的 RTX3090 上测量的。SEA-RAFT 的准确性接近 MS-RAFT+ [19] 所达到的最佳水平,但体积更小 \({11} \times\) 且速度更快 \({24} \times\)

which still outperforms all other methods on Spring, can run at 21fps when processing 1080p images on an RTX3090, \(3 \times\) faster than the original RAFT.

该模型在 Spring 上仍然优于所有其他方法,在 RTX3090 上处理 1080p 图像时可以达到 21fps,比原始 RAFT 快 \(3 \times\)

We achieve this by introducing a combination of improvements over the original RAFT:

我们通过在原始 RAFT 基础上引入一系列改进来实现这一点:

  • Mixture of Laplace Loss: Instead of the standard \({L}_{1}\) loss,we train the network to predict parameters of a mixture of Laplace distributions to maximize the log-likelihood of the ground truth flow. As we will demonstrate, this new loss reduces overfitting to ambiguous cases and improves generalization.

  • 混合拉普拉斯损失:我们不是使用标准的 \({L}_{1}\) 损失,而是训练网络预测混合拉普拉斯分布的参数,以最大化地面真实流的对数似然。正如我们将展示的,这种新损失减少了过度拟合到模糊情况并提高了泛化能力。

  • Directly Regressed Initial Flow: Instead of initializing the flow field to zero before iterative refinement, we directly predict the initial flow by reusing the existing context encoder and feeding it the stacked input frames. This simple change introduces minimal overhead but is surprisingly effective in reducing the number of iterations and improving efficiency.

  • 直接回归初始流:在迭代细化之前,我们不是将流场初始化为零,而是通过重用现有的上下文编码器并将其输入堆叠的帧来直接预测初始流。这一简单的改变引入的额外开销最小,但令人惊讶地有效,可以减少迭代次数并提高效率。

  • Rigid-Flow Pre-Training: We find that pre-training on TartanAir [52], which can significantly improve generalization, despite the limited diversity of flow, which is induced purely by camera motion in a static scene.

  • 刚性流预训练:我们发现,尽管流的变化仅由静态场景中的相机运动引起,多样性有限,但在TartanAir [52]上进行预训练可以显著提高泛化能力。

These improvements are novel in the context of RAFT-style methods for optical flow. Moreover, they are orthogonal to the improvements proposed in existing RAFT-style methods, which focus on replacing certain blocks with newer designs, such as replacing convolutional blocks with transformers.

这些改进在RAFT风格的光流方法背景下是新颖的。此外,它们与现有RAFT风格方法中提出的改进是正交的,后者主要关注用更新的设计替换某些模块,例如用变换器替换卷积块。

Besides the main improvements above, SEA-RAFT also incorporates architectural changes that greatly simplify the original RAFT. In particular, we find that certain custom designs of the original RAFT are unnecessary and can be replaced with standard off-the-shelf modules. For example, the original feature encoder and context encoder were custom-designed and must use different normalization layers for stable training; we replaced each with a standard ResNet. In addition, we replace the original convolutional GRU with a simple RNN consisting entirely of ConvNext blocks. Such simplifications make it easy for SEA-RAFT to incorporate new neural building blocks and scale to larger datasets.

除了上述主要改进之外,SEA-RAFT还结合了架构上的变化,这些变化极大地简化了原始的RAFT。特别是,我们发现原始RAFT的某些定制设计是不必要的,可以用标准的现成模块替换。例如,原始的特征编码器和上下文编码器是定制设计的,必须使用不同的归一化层以确保稳定训练;我们用标准的ResNet替换了它们。此外,我们用完全由ConvNext块组成的简单RNN替换了原始的卷积GRU。这些简化使得SEA-RAFT易于整合新的神经构建块,并扩展到更大的数据集。

We perform extensive experiments to evaluate SEA-RAFT on standard benchmarks including Spring, Sintel, and KITTI. We also validate the effectiveness of our improvements through ablation studies.

我们在包括Spring、Sintel和KITTI在内的标准基准上进行了广泛的实验,以评估SEA-RAFT。我们还通过消融研究验证了我们改进的有效性。

Estimating Optical Flow Classical approaches treated optical flow as an optimization problem that maximizes visual similarity between corresponding pixels, with strong regularization. \(\left\lbrack {5,{13},{62}}\right\rbrack\) . Current methods \(\lbrack 6,9,{14},{16} - {19},{24},{30},{31}\) , 42-45,48,50,54-57,65,66,68] are mostly based on deep learning. FlowNets [9,17] regarded optical flow as a dense regression problem and used stacked convolution blocks for prediction. DCNet [58] and PWC-Net [45] introduced 4D cost-volume to explicitly model pixel correspondence. RAFT [50] further combined multi-scale 4D cost-volume with recurrent iterative refinements, achieving large improvements and spawning many follow-ups \(\left\lbrack {{14},{19},{30},{31},{43},{44},{48},{66},{68}}\right\rbrack\) .

估计光流 传统方法将光流视为一个优化问题,该问题最大化对应像素之间的视觉相似性,并具有强正则化。\(\left\lbrack {5,{13},{62}}\right\rbrack\)。当前方法\(\lbrack 6,9,{14},{16} - {19},{24},{30},{31}\),42-45,48,50,54-57,65,66,68]主要基于深度学习。FlowNets[9,17]将光流视为密集回归问题,并使用堆叠的卷积块进行预测。DCNet[58]和PWC-Net[45]引入了4D成本体积以显式建模像素对应关系。RAFT[50]进一步结合多尺度4D成本体积与循环迭代细化,实现了大幅改进并催生了许多后续工作\(\left\lbrack {{14},{19},{30},{31},{43},{44},{48},{66},{68}}\right\rbrack\)

Our method is a new variant of RAFT [50] with several improvements including a new loss function, direct regression of initial flow, rigid-flow pre-training, and architectural simplifications. All of these improvements are new compared to existing RAFT variants. In particular, our direct regression of initial flow is new compared to existing efficient RAFT variants \(\left\lbrack {6,{11},{37}}\right\rbrack\) ,which mainly focus on efficient implementations of RAFT modules. This direct regression is a simple change with minimal overhead, but substantially reduces the number of RAFT iterations needed.

我们的方法是RAFT[50]的一个新变体,包含多项改进,包括新的损失函数、初始流直接回归、刚性流预训练和架构简化。与现有RAFT变体相比,所有这些改进都是新的。特别是,我们的初始流直接回归与现有高效RAFT变体\(\left\lbrack {6,{11},{37}}\right\rbrack\)相比是新的,后者主要关注RAFT模块的高效实现。这种直接回归是一个简单的改变,开销最小,但显著减少了所需的RAFT迭代次数。

Data for Optical Flow FlyingChairs and FlyingThings3D [9,34] are commonly used datasets for optical flow. They provide a large amount of synthetic data but have limited realism. Sintel [3], VIPER [41], Infinigen [40], and Spring [35] are more realistic, using open-source 3D animations, games or procedurally generated scenes. Besides synthetic data, Middlebury, KITTI, and HD1K [1,12,23,36] provide annotations for real-world image pairs. These datasets are limited in both quantity and diversity due to the difficulty of accurately annotating optical flow in the real world. To leverage more data,several methods \(\left\lbrack {8,{42},{54},{55}}\right\rbrack\) pre-train their models on different tasks. MatchFlow [8] pre-trains on geometric image matching (GIM) using MegaDepth [26]. Croco-Flow [54, 55], DDVM [42], and Flowformer++ [43] pre-train on unlabeled data. We pre-train SEA-RAFT on rigid flow using TartanAir [52]. Though TartanAir [52] has been used in other methods such as DDVM [42] and CroCo-Flow [54,55], our adoption of rigid-flow pre-training is new in the context of RAFT-style methods.

光流数据集 FlyingChairs 和 FlyingThings3D [9,34] 是常用的数据集。它们提供了大量的合成数据,但真实性有限。Sintel [3]、VIPER [41]、Infinigen [40] 和 Spring [35] 使用开源的 3D 动画、游戏或程序生成的场景,更加真实。除了合成数据外,Middlebury、KITTI 和 HD1K [1,12,23,36] 为真实世界的图像对提供注释。由于在现实世界中准确标注光流的难度,这些数据集在数量和多样性上都有限。为了利用更多数据,几种方法 \(\left\lbrack {8,{42},{54},{55}}\right\rbrack\) 在不同任务上预训练其模型。MatchFlow [8] 使用 MegaDepth [26] 在几何图像匹配(GIM)上预训练。Croco-Flow [54, 55]、DDVM [42] 和 Flowformer++ [43] 在未标注数据上预训练。我们在 TartanAir [52] 上使用刚性流预训练 SEA-RAFT。尽管 TartanAir [52] 已被其他方法如 DDVM [42] 和 CroCo-Flow [54,55] 使用,但我们在 RAFT 风格方法的背景下采用刚性流预训练是新颖的。

Predicting Probability Distributions Predicting probability distributions is a common practice in computer vision \(\left\lbrack {2,4,{25},{32},{47},{51},{53},{64}}\right\rbrack\) . In tasks closely related to optical flow such as keypoint matching \(\left\lbrack {4,{47},{51},{64}}\right\rbrack\) ,the variance of the probability distribution reflects uncertainty of predictions and therefore is useful for many applications. For example, LoFTR [47] filters out uncertain matching pairs. Aspanformer [4] adjusts the look-up radius based on uncertainty.

预测概率分布 预测概率分布是计算机视觉中的一种常见做法 \(\left\lbrack {2,4,{25},{32},{47},{51},{53},{64}}\right\rbrack\)。在诸如关键点匹配 \(\left\lbrack {4,{47},{51},{64}}\right\rbrack\) 等与光流密切相关的任务中,概率分布的方差反映了预测的不确定性,因此对许多应用都很有用。例如,LoFTR [47] 过滤掉不确定的匹配对。Aspanformer [4] 根据不确定性调整查找半径。

Fig. 2: Compared with RAFT [50], SEA-RAFT introduces (1) rigid-flow pre-training, (2) mixture of Laplace loss, and (3) direct regression of initial flow.

图 2:与 RAFT [50] 相比,SEA-RAFT 引入了(1)刚性流预训练,(2)拉普拉斯混合损失,以及(3)初始流的直接回归。

To handle the ambiguity caused by heavy occlusion, SEA-RAFT predicts a mixture of Laplace (MoL) distribution. Although MoL has been used in keypoint matching methods such as PDC-Net + [51], our use of MoL is new in the context of RAFT-style methods. In addition, our formulation is different in that we require one mixture component to have a constant variance, making it equivalent to the \({L}_{1}\) loss that aligns better with the optical flow evaluation metrics. This difference is crucial for achieving competitive performance in optical flow, where every pixel needs accurate correspondence, unlike keypoint matching, where a subset of reliable matches suffices.

为了处理由严重遮挡引起的模糊性,SEA-RAFT 预测了一个拉普拉斯混合(MoL)分布。尽管 MoL 已在关键点匹配方法中使用,如 PDC-Net + [51],但我们在 RAFT 风格方法中使用 MoL 是新颖的。此外,我们的公式不同之处在于我们要求一个混合分量具有恒定方差,使其等同于与光流评估指标更好地对齐的 \({L}_{1}\) 损失。这一差异对于实现光流中的竞争性能至关重要,其中每个像素都需要精确对应,而不像关键点匹配,其中一组可靠匹配就足够了。

3 Method

In this section, we first describe the iterative refinement in RAFT and then introduce the improvements that lead to SEA-RAFT.

在本节中,我们首先描述 RAFT 中的迭代细化,然后介绍导致 SEA-RAFT 的改进。

3.1 Iterative refinement

Given two adjacent RGB frames, RAFT predicts a field of pixel-wise 2D vectors through iterative refinement that consists of two parts: (1) feature and context encoders, which transform images into lower-resolution dense features, and (2) an RNN unit, which iteratively refines the predictions.

给定两个相邻的 RGB 帧,RAFT 通过迭代细化预测像素级的 2D 向量场,该细化包括两部分:(1)特征和上下文编码器,将图像转换为低分辨率的密集特征,以及(2)一个 RNN 单元,迭代地细化预测。

Given two images \({I}_{1},{I}_{2} \in {\mathbb{R}}^{H \times W \times 3}\) ,the feature encoder \(F\) takes \({I}_{1},{I}_{2}\) as inputs separately and outputs a lower-resolution feature \(F\left( {I}_{1}\right) ,F\left( {I}_{2}\right) \in {\mathbb{R}}^{h \times w \times D}\) . The context encoder \(C\) takes source image \({I}_{1}\) as input and outputs a context feature \(C\left( {I}_{1}\right) \in {\mathbb{R}}^{h \times w \times D}\) . A multi-scale \(4\mathrm{D}\) correlation volume \(\left\{ {V}_{k}\right\}\) is then built with the features from feature encoder \(F\) :

给定两幅图像 \({I}_{1},{I}_{2} \in {\mathbb{R}}^{H \times W \times 3}\),特征编码器 \(F\) 分别以 \({I}_{1},{I}_{2}\) 作为输入并输出一个低分辨率特征 \(F\left( {I}_{1}\right) ,F\left( {I}_{2}\right) \in {\mathbb{R}}^{h \times w \times D}\)。上下文编码器 \(C\) 以源图像 \({I}_{1}\) 作为输入并输出一个上下文特征 \(C\left( {I}_{1}\right) \in {\mathbb{R}}^{h \times w \times D}\)。然后,使用来自特征编码器 \(F\) 的特征构建一个多尺度的 \(4\mathrm{D}\) 相关体积 \(\left\{ {V}_{k}\right\}\)

\[{V}_{k} = F\left( {I}_{1}\right) \circ \operatorname{AvgPool}{\left( F\left( {I}_{2}\right) ,{2}^{k}\right) }^{\top } \in {\mathbb{R}}^{h \times w \times \frac{h}{{2}^{k}} \times \frac{w}{{2}^{k}}}, \]

where \(\circ\) represents the correlation operator,which computes similarities (as dot products of feature vectors) between all pairs of pixels across two feature maps.

其中 \(\circ\) 表示相关运算符,该运算符计算两个特征图之间所有像素对之间的相似性(作为特征向量的点积)。

Several works \(\left\lbrack {{18},{19}}\right\rbrack\) have explored the optimal choices of the number of levels in the cost volume \(\left( k\right)\) and the feature resolution \(\left( {h,w}\right)\) . In SEA-RAFT, we simply follow the original setting in RAFT [50]: \(\left( {h,w}\right) = \frac{1}{8}\left( {H,W}\right) ,k = 4\) .

多项工作 \(\left\lbrack {{18},{19}}\right\rbrack\) 已经探讨了成本体积 \(\left( k\right)\) 中层数和特征分辨率 \(\left( {h,w}\right)\) 的最佳选择。在 SEA-RAFT 中,我们简单地遵循 RAFT [50] 中的原始设置:\(\left( {h,w}\right) = \frac{1}{8}\left( {H,W}\right) ,k = 4\)

RAFT iteratively refines a flow prediction \(\mu\) . Initially, \(\mu\) is set to be all zeros. Each refinement step uses the current flow prediction \(\mu\) to fetch a \({D}_{M}\) -dim motion feature \(M\) from the multi-scale correlation volume \(\left\{ {V}_{k}\right\}\) with a look-up radius \(r\) :

RAFT 迭代地细化流预测 \(\mu\)。最初,\(\mu\) 被设置为全零。每个细化步骤使用当前的流预测 \(\mu\) 从多尺度相关体积 \(\left\{ {V}_{k}\right\}\) 中获取一个 \({D}_{M}\) 维的运动特征 \(M\),查找半径为 \(r\)

\[M = \operatorname{MotionEncoder}\left( {\operatorname{LookUp}\left( {\left\{ {V}_{k}\right\} ,\mu ,r}\right) }\right) \in {\mathbb{R}}^{h \times w \times {D}_{M}}, \]

where the Lookup operator returns a motion feature vector for each pixel in \({I}_{1}\) , consisting of similarities between the pixel in \({I}_{1}\) and its current correspondence’s neighboring pixels in \({I}_{2}\) within the radius \(r\) . The motion feature vector is further transformed by a motion encoder.

其中查找操作符为 \({I}_{1}\) 中的每个像素返回一个运动特征向量,该向量由 \({I}_{1}\) 中的像素与其在 \({I}_{2}\) 中当前对应像素的邻近像素在半径 \(r\) 内的相似性组成。运动特征向量进一步通过运动编码器进行变换。

Existing works \(\left\lbrack {4,{11},{21}}\right\rbrack\) have explored dynamic radius and look-up when obtaining the motion features from \(\left\{ {V}_{k}\right\}\) . For simplicity of design,SEA-RAFT follows the original RAFT and sets the look-up radius \(r = 4\) to a fixed constant. The motion feature \(M\) is fed into the RNN cell along with hidden state \(h\) and context feature \(C\left( {I}_{1}\right)\) . From the new hidden state \({h}^{\prime }\) ,the residual flow \({\Delta \mu }\) is regressed by a 2-layer FlowHead:

现有工作 \(\left\lbrack {4,{11},{21}}\right\rbrack\) 已经探讨了在从 \(\left\{ {V}_{k}\right\}\) 获取运动特征时使用动态半径和查找。为了设计的简单性,SEA-RAFT 遵循原始的 RAFT 并将查找半径 \(r = 4\) 设置为一个固定的常数。运动特征 \(M\) 与隐藏状态 \(h\) 和上下文特征 \(C\left( {I}_{1}\right)\) 一起输入到 RNN 单元中。从新的隐藏状态 \({h}^{\prime }\) 中,残差流 \({\Delta \mu }\) 通过一个两层的 FlowHead 进行回归:

\[{h}^{\prime } = \operatorname{RNN}\left( {h,M,C\left( {I}_{1}\right) }\right) \]

\[{\mathit{Δ}\mu } = \mathtt{{FlowHead}}\left( {h}^{\prime }\right) \]

Methods using RAFT-Style iterative refinement [14,50] usually need many iterations: 12 in training and as many as 32 in inference. As a result, RNN-based iterative refinement is a significant bottleneck in latency. Though there have been attempts \(\left\lbrack {6,{11}}\right\rbrack\) to reduce the number of iterations,the performance drastically drops with fewer iterations. In contrast, SEA-RAFT only needs 4 iterations in training and up to 12 iterations in inference to achieve competitive performance.

使用 RAFT 风格迭代细化的方法 [14,50] 通常需要多次迭代:训练中需要 12 次,推理中多达 32 次。因此,基于 RNN 的迭代细化是延迟的一个显著瓶颈。尽管有尝试 \(\left\lbrack {6,{11}}\right\rbrack\) 减少迭代次数,但性能会随着迭代次数减少而急剧下降。相比之下,SEA-RAFT 在训练中仅需要 4 次迭代,在推理中最多需要 12 次迭代即可达到竞争性性能。

3.2 Mixture-of-Laplace Loss 拉普拉斯混合损失

Most prior works are supervised using an endpoint-error loss on all pixels. However, optical flow training data often contains ambiguous, unpredictable samples, which can dominate this loss empirically.

大多数先前的工作使用所有像素的端点误差损失进行监督。然而,光流训练数据通常包含模糊、不可预测的样本,这些样本在经验上可以主导这种损失。

Ambiguous Cases Ambiguous cases of optical flow can arise with heavy occlusion Fig. 3. While in many cases the motion of occluded pixels can be predicted, sometimes the ambiguity can be too large to predict a single outcome. We examined 10 samples with the highest endpoint-error in the training and validation sets of FlyingChairs [9] and found that ambiguous cases dominate the error.

模糊情况 光流的模糊情况可能因严重遮挡而出现,如图3所示。虽然在许多情况下被遮挡像素的运动可以预测,但有时模糊性可能太大以至于无法预测单一结果。我们检查了FlyingChairs [9]训练和验证集中端点误差最高的10个样本,发现模糊情况主导了误差。

Review of Probabilistic Regression Prior works for image-matching have proposed probabilistic losses to enable their model to express aleatoric or epistemic uncertainty \(\left\lbrack {4,{47},{51},{51},{53},{55},{64}}\right\rbrack\) . These approaches regress the parameters

概率回归的回顾 先前用于图像匹配的工作提出了概率损失,以使模型能够表达偶然或认知不确定性 \(\left\lbrack {4,{47},{51},{51},{53},{55},{64}}\right\rbrack\)。这些方法回归概率模型的参数

Fig. 3: Ambiguous cases can occur frequently in training data where flow is unpredictable due to occlusion. Such cases can dominate the \({L}_{1}\) loss (shown as an error map) used by current methods \(\left\lbrack {{50},{56}}\right\rbrack\) . Our new training loss allows the model to account for such uncertainty.

图3:模糊情况在训练数据中可能经常发生,由于遮挡导致流不可预测。此类情况可以主导当前方法 \(\left\lbrack {{50},{56}}\right\rbrack\) 使用的 \({L}_{1}\) 损失(显示为误差图)。我们的新训练损失允许模型考虑这种不确定性。

of the probabilistic model and maximize the log-likelihood of the ground truth during training.

在训练过程中,回归概率模型的参数并最大化地面真实值的对数似然。

Given an image pair \(\left\{ {{I}_{1},{I}_{2}}\right\}\) and the flow ground truth \({\mu }_{gt}\) ,the training loss is

给定一对图像 \(\left\{ {{I}_{1},{I}_{2}}\right\}\) 和流的真实值 \({\mu }_{gt}\),训练损失为

\[{\mathcal{L}}_{\text{prob }} = - \log {p}_{\theta }\left( {\mu = {\mu }_{gt} \mid {I}_{1},{I}_{2}}\right) \]

where the probability density function \({p}_{\theta }\) is parameterized by the network. Prior work has formulated \({p}_{\theta }\) as a Gaussian or a Laplace distribution with a predicted mean and variance. For example, we can formulate a naive version of probabilistic regression by assuming: (1) \({p}_{\theta }\) is Laplace with mean \(\mu \in {\mathbb{R}}^{H \times W \times 2}\) and scale \(b \in {\mathbb{R}}^{H \times W \times 1}\) predicted by the network,(2) the flow distribution is pixel-wisely independent, and (3) the x-direction flow and the y-direction flow are independent but share the same scale parameter \(b\) :

其中概率密度函数 \({p}_{\theta }\) 由网络参数化。先前的工作将 \({p}_{\theta }\) 形式化为具有预测均值和方差的高斯分布或拉普拉斯分布。例如,我们可以通过假设:(1) \({p}_{\theta }\) 是具有网络预测的均值 \(\mu \in {\mathbb{R}}^{H \times W \times 2}\) 和尺度 \(b \in {\mathbb{R}}^{H \times W \times 1}\) 的拉普拉斯分布,(2) 流分布是像素独立的,以及 (3) x方向流和y方向流独立但共享相同的尺度参数 \(b\),来形式化概率回归的简单版本:

\[{\mathcal{L}}_{Lap} = \frac{1}{HW}\mathop{\sum }\limits_{u}\mathop{\sum }\limits_{v}\left( {\log {2b}\left( {u,v}\right) + \frac{{\begin{Vmatrix}{\mu }_{gt}\left( u,v\right) - \mu \left( u,v\right) \end{Vmatrix}}_{1}}{{2b}\left( {u,v}\right) }}\right) \tag{1} \]

where \(u,v\) are indices to the pixels. The Laplace loss can be regarded as an extended version of \({L}_{1}\) loss with an extra penalty term \(b\) . During inference, \(\mu\) represents the flow prediction,and the scale factor \(b\) provides an estimation of uncertainty. However, we find this naive probabilistic regression does not work well on optical flow, which has also been pointed out by prior work [64].

其中 \(u,v\) 是像素的索引。拉普拉斯损失可以视为 \({L}_{1}\) 损失的扩展版本,带有一个额外的惩罚项 \(b\)。在推理过程中,\(\mu\) 表示流预测,而尺度因子 \(b\) 提供了不确定性的估计。然而,我们发现这种朴素概率回归在光流上效果不佳,这一点也已被先前的工作 [64] 指出。

Mixture of Laplace One reason that naive probabilistic regression performs poorly is numerical instability as the loss contains a log term. To address this issue,we regress \(b\left( {u,v}\right)\) directly in log-space. This approach makes training more stable compared to previous approaches which clamp \(b\) to \(\lbrack \epsilon ,\infty )\) ,where \(\epsilon\) is a small positive number.

拉普拉斯混合分布 朴素概率回归表现不佳的一个原因是数值不稳定性,因为损失包含一个对数项。为了解决这个问题,我们在对数空间中直接回归 \(b\left( {u,v}\right)\)。与之前将 \(b\) 限制在 \(\lbrack \epsilon ,\infty )\) 的方法相比,这种方法使得训练更加稳定,其中 \(\epsilon\) 是一个小的正数。

Another reason that naive probabilistic regression performs poorly is that it deviates from the standard endpoint-error metric, which only cares about the

朴素概率回归表现不佳的另一个原因是它偏离了标准的端点误差度量,该度量只关心

Fig. 4: Visualization on Spring [35] test set.

图 4:Spring [35] 测试集的可视化。

\({L}_{1}\) difference,but not the uncertainty estimation. Thus,we propose to use a mixture of two Laplace distributions: one for ordinary cases, and the other for ambiguous cases,with mixing coefficient \(\alpha \in \left\lbrack {0,1}\right\rbrack\) :

\({L}_{1}\) 差异,但不关心不确定性估计。因此,我们提出使用两个拉普拉斯分布的混合:一个用于普通情况,另一个用于模糊情况,混合系数为 \(\alpha \in \left\lbrack {0,1}\right\rbrack\)

\[\operatorname{MixLap}\left( {x;\alpha ,{\beta }_{1},{\beta }_{2},\mu }\right) = \alpha \cdot \frac{{e}^{-\frac{\left| x - \mu \right| }{{e}^{{\beta }_{1}}}}}{2{e}^{{\beta }_{1}}} + \left( {1 - \alpha }\right) \cdot \frac{{e}^{-\frac{\left| x - \mu \right| }{{e}^{{\beta }_{2}}}}}{2{e}^{{\beta }_{2}}} \]

Intuitively, at each pixel, we want the first component of the mixture to be aligned with the endpoint-error metric, and the second component to account for ambiguous cases. To explicitly enforce this,we fix \({\beta }_{1} = 0\) ,such that the network is encouraged to optimize for the L1 loss when possible. This leads to the following Mixture-of-Laplace (MoL) loss:

直观上,在每个像素上,我们希望混合的第一个分量与端点误差度量对齐,第二个分量用于模糊情况。为了明确强化这一点,我们固定 \({\beta }_{1} = 0\),使得网络在可能的情况下鼓励优化 L1 损失。这导致了以下混合拉普拉斯(MoL)损失:

\[{\mathcal{L}}_{MoL} = - \frac{1}{2HW}\mathop{\sum }\limits_{u}\mathop{\sum }\limits_{v}\mathop{\sum }\limits_{{d \in \{ x,y\} }}\log \left\lbrack {\operatorname{MixLap}\left( {{\mu }_{gt}{\left( u,v\right) }_{d};\alpha \left( {u,v}\right) ,0,{\beta }_{2}\left( {u,v}\right) ,\mu {\left( u,v\right) }_{d}}\right) }\right\rbrack \tag{2} \]

where \(d\) indexes the axe of the flow vector (the \(x\) direction or \(y\) direction).

其中 \(d\) 索引了流矢量的轴(\(x\) 方向或 \(y\) 方向)。

The free parameters \(\alpha ,{\beta }_{2},\mu\) of \({\mathcal{L}}_{MoL}\) are predicted by the network. Intuitively,a higher \(\alpha\) means the flow prediction of this pixel is more "ordinary" instead of "ambiguous". Mathematically,a higher \(\alpha\) makes \({\mathcal{L}}_{moL}\) behave like an \({L}_{1}\) loss. In Sec. 4.3,we then show that this property leads to better accuracy.

网络预测了 \({\mathcal{L}}_{MoL}\) 的自由参数 \(\alpha ,{\beta }_{2},\mu\)。直观上,较高的 \(\alpha\) 意味着该像素的流量预测更“普通”而非“模糊”。从数学上讲,较高的 \(\alpha\) 使得 \({\mathcal{L}}_{moL}\) 表现得像一个 \({L}_{1}\) 损失。在第 4.3 节中,我们随后展示了这一特性导致更高的准确性。

Note that though the mixture model has been used in keypoint matching [4, \({47},{51}\rbrack\) ,its application to optical flow requires a different formulation because the goal is substantially different. In keypoint matching, the goal is to identify a subset of reliable matches for downstream applications such as camera pose estimation. Predicting uncertainty serves to filter out unreliable matches, and there is no explicit penalty for predicting few correspondences. As a result, it is not essential for them to align a mixing component to \({L}_{1}\) loss. In optical flow, we are evaluated on the flow prediction for every pixel.

请注意,尽管混合模型已用于关键点匹配 [4, \({47},{51}\rbrack\),但其应用于光流需要不同的表述,因为目标本质上不同。在关键点匹配中,目标是识别一组可靠的匹配,用于下游应用,如相机姿态估计。预测不确定性用于过滤不可靠的匹配,而对于预测较少的对应关系没有明确的惩罚。因此,将混合组件对齐到 \({L}_{1}\) 损失并非必要。在光流中,我们针对每个像素的流量预测进行评估。

Implementation Details We set an upper bound for \(\beta\) to 10 in the loss to make the training more stable. We also re-predict \(\alpha\) and \(\beta\) every update iteration. We can similarly define the probabilistic sequence loss as:

实现细节 我们在损失中为 \(\beta\) 设置了一个上限 10,以使训练更加稳定。我们还每更新迭代重新预测 \(\alpha\)\(\beta\)。我们可以类似地定义概率序列损失为:

\[{\mathcal{L}}_{all} = \mathop{\sum }\limits_{{i = 1}}^{N}{\gamma }^{N - i}{\mathcal{L}}_{MoL}^{i} \tag{3} \]

Fig. 5: Visualization on Sintel [3], KITTI [36], and Middlebury [1].

图 5:在 Sintel [3]、KITTI [36] 和 Middlebury [1] 上的可视化。

where \({\mathcal{L}}_{\operatorname{mix}}^{i}\) denotes the probabilistic loss in iteration \(i,N\) denotes the number of iterations,and \(\gamma < 1\) exponentially downweights the early iterations. We empirically observe that our method significantly reduces the number of update iterations needed in inference. In fact, \(N = 4\) is sufficient for SEA-RAFT to take first place on the Spring [35] benchmark. We provide detailed ablations in Tab. 4.

其中 \({\mathcal{L}}_{\operatorname{mix}}^{i}\) 表示迭代中的概率损失,\(i,N\) 表示迭代次数,而 \(\gamma < 1\) 指数级地降低早期迭代的权重。我们实证观察到,我们的方法显著减少了推理中所需的更新迭代次数。事实上,\(N = 4\) 足以使 SEA-RAFT 在 Spring [35] 基准测试中名列第一。我们在表 4 中提供了详细的消融研究。

3.3 Direct Regression of Initial Flow

RAFT-style iterative refinements \(\left\lbrack {8,{14},{31},{37},{48},{66}}\right\rbrack\) typically zero-initialize the flow field. However, zero-initialization may deviate substantially from the ground truth, thus needing many iterations. In SEA-RAFT, we borrow an idea from the FlowNet family of methods \(\left\lbrack {9,{17}}\right\rbrack\) to predict an initial estimate of optical flow from the context encoder, given both frames as input. We also predict an associated MoL (see Sec. 3.2).

RAFT 风格的迭代改进 \(\left\lbrack {8,{14},{31},{37},{48},{66}}\right\rbrack\) 通常将流场初始化为零。然而,零初始化可能与真实情况相差甚远,因此需要多次迭代。在 SEA-RAFT 中,我们从 FlowNet 系列方法 \(\left\lbrack {9,{17}}\right\rbrack\) 中借鉴了一个想法,即从上下文编码器中预测光流的初始估计值,给定两个帧作为输入。我们还预测了相关的 MoL(见第 3.2 节)。

This simple modification also significantly improves the convergence speed of the iterative refinement framework, allowing one to use fewer iterations during inference. Detailed ablations are shown in Tab. 4.

这一简单的修改也显著提高了迭代改进框架的收敛速度,使得在推理过程中可以使用更少的迭代次数。详细的消融实验结果见表 4。

3.4 Large-Scale Rigid-Flow Pre-Training

Most prior works train on a small number of datasets with limited size, diversity and realism \(\left\lbrack {9,{34}}\right\rbrack\) . To improve generationalization,we pre-train SEA-RAFT on TartanAir [52], which provides optical flow annotations between a pair of (nonrectified) stereo cameras. This type of motion field is a special case of optical flow due to viewpoint change in a rigid static scene. Despite its limited motion diversity, it enables SEA-RAFT to train on data with higher realism and scene diversity, leading to better generalization.

大多数先前的工作在数量有限、多样性和真实性有限的数据集 \(\left\lbrack {9,{34}}\right\rbrack\) 上进行训练。为了提高泛化能力,我们在 TartanAir [52] 上预训练 SEA-RAFT,该数据集提供了非校正立体相机对之间的光流标注。这种运动场是光流的一个特例,由于刚性静态场景中的视角变化。尽管其运动多样性有限,但它使 SEA-RAFT 能够在更具真实性和场景多样性的数据上进行训练,从而提高泛化能力。

3.5 Simplifications

We also provide a few architecture changes that greatly simplify the original RAFT [50]. First, we adopt truncated, ImageNet [7] pre-trained ResNets for the backbones. We also substitute the ConvGRU in RAFT with two ConvNeXt [28] blocks, which we show provides better efficiency and training stability. The detailed ablations of these changes are shown in Tab. 4.

我们还提供了一些架构上的改变,这些改变极大地简化了原始的 RAFT [50]。首先,我们采用截断的、在 ImageNet [7] 上预训练的 ResNets 作为主干网络。我们还用两个 ConvNeXt [28] 块替换了 RAFT 中的 ConvGRU,我们表明这提供了更好的效率和训练稳定性。这些改变的详细消融实验结果见表 4。

4 Experiments

We evaluate SEA-RAFT on Spring [35], KITTI [12], and Sintel [3]. Following previous works, we also incorporate FlyingChairs [9], FlyingThings [34], and HD1K [23] into our training pipeline. To verify the effectiveness of TartanAir [52] rigid-flow pre-training, we provide the performance gain from it in different settings.

我们在Spring [35]、KITTI [12]和Sintel [3]上评估SEA-RAFT。按照先前的工作,我们还将在训练流程中加入FlyingChairs [9]、FlyingThings [34]和HD1K [23]。为了验证TartanAir [52]刚性流预训练的有效性,我们在不同设置下提供了其带来的性能提升。

Model Details SEA-RAFT is implemented in PyTorch [38]. There are three different types of SEA-RAFT and we denote them as SEA-RAFT(S/M/L). The only differences among them are the backbone choices and the number of iterations in inference. Specifically, SEA-RAFT(S) uses the first 6 layers of ResNet-18 as the feature/context encoder, and SEA-RAFT(M) uses the first 13 layers of ResNet-34. The pre-trained weights we use are downloaded from torchvision. SEA-RAFT(S) and SEA-RAFT(M) use the same architecture for the recurrent units and keep the number of iterations \(N = 4\) in both training and inference. SEA-RAFT(L) can be regarded as an extension based on SEA-RAFT(M): they share the same weights,but SEA-RAFT(L) uses \(N = {12}\) iterations in inference. Following RAFT [50],we stop the gradient for \(\mu\) when computing \({\mu }^{\prime } = \mu + {\Delta \mu }\) and only propagate the gradient for residual flow \({\Delta \mu }\) .

模型细节 SEA-RAFT在PyTorch [38]中实现。有三种不同类型的SEA-RAFT,我们分别表示为SEA-RAFT(S/M/L)。它们之间的唯一区别在于骨干选择和推理中的迭代次数。具体来说,SEA-RAFT(S)使用ResNet-18的前6层作为特征/上下文编码器,而SEA-RAFT(M)使用ResNet-34的前13层。我们使用的预训练权重从torchvision下载。SEA-RAFT(S)和SEA-RAFT(M)在循环单元中使用相同的架构,并在训练和推理中保持迭代次数\(N = 4\)。SEA-RAFT(L)可以视为基于SEA-RAFT(M)的扩展:它们共享相同的权重,但SEA-RAFT(L)在推理中使用\(N = {12}\)次迭代。按照RAFT [50],我们在计算\({\mu }^{\prime } = \mu + {\Delta \mu }\)时停止\(\mu\)的梯度,并且仅传播残余流\({\Delta \mu }\)的梯度。

Training Details As mentioned in Sec. 3.4, We pre-train SEA-RAFT on Tar-tanAir [52] for 300k steps with a batch size of 32,input resolution \({480} \times {640}\) and learning rate \(4 \times {10}^{-4}\) . Similar to RAFT [50],MaskFlowNet [65] and PWC-Net + [45], we then train our models on FlyingChairs [9] for 100k steps with a batch size of 16,input resolution \({368} \times {496}\) ,learning rate \({2.5} \times {10}^{-4}\) and FlyingTh-ings3D [34] for 120k steps with a batch size of 32,input resolution \({432} \times {960}\) , learning rate \(4 \times {10}^{-4}\) (denoted as "C+T" following previous works). For the submissions on Sintel [3] benchmark, we fine-tune the model from "C+T" on a mixture of Sintel [3], FlyingThings3D clean pass [34], KITTI [12] and HD1K [23] for \({300}\mathrm{k}\) steps with a batch size of 32,input resolution \({432} \times {960}\) and learning rate \(4 \times {10}^{-4}\) (denoted as "C+T+S+K+H" following previous works). Different from previous methods, we reduce the percentage of Sintel [3] in the mixture dataset, which is usually more than \({70}\%\) in previous papers. Details will be mentioned in the supplementary material. For KITTI [12] submissions, we fine-tune our models from "C+T+S+K+H" on the KITTI training set for extra 10k steps with a batch size of 16,input resolution \({432} \times {960}\) and learning rate \({10}^{-4}\) . For Spring [35] submissions, we fine-tune our models from "C+T+S+K+H" on the

训练细节 如第3.4节所述,我们在TitanAir [52]上对SEA-RAFT进行30万步的预训练,批量大小为32,输入分辨率\({480} \times {640}\)和学习率\(4 \times {10}^{-4}\)。与RAFT [50]、MaskFlowNet [65]和PWC-Net + [45]类似,我们随后在FlyingChairs [9]上对我们的模型进行10万步的训练,批量大小为16,输入分辨率\({368} \times {496}\),学习率\({2.5} \times {10}^{-4}\),并在FlyingThings3D [34]上进行12万步的训练,批量大小为32,输入分辨率\({432} \times {960}\),学习率\(4 \times {10}^{-4}\)(按照先前的工作表示为“C+T”)。对于Sintel [3]基准的提交,我们从“C+T”在Sintel [3]、FlyingThings3D clean pass [34]、KITTI [12]和HD1K [23]的混合数据集上进行微调,步数为\({300}\mathrm{k}\),批量大小为32,输入分辨率\({432} \times {960}\)和学习率\(4 \times {10}^{-4}\)(按照先前的工作表示为“C+T+S+K+H”)。与先前的方法不同,我们减少了混合数据集中Sintel [3]的比例,这在先前的论文中通常超过\({70}\%\)。细节将在补充材料中提及。对于KITTI [12]的提交,我们从“C+T+S+K+H”在KITTI训练集上进行额外的1万步微调,批量大小为16,输入分辨率\({432} \times {960}\)和学习率\({10}^{-4}\)。对于Spring [35]的提交,我们从“C+T+S+K+H”在

Extra DataMethodSpring (train)Spring (test)
Fine-tune$1 \leq p \leq x \downarrow$$\mathrm{{EPE}} \downarrow$$1\mathrm{{px}} \downarrow$$\mathrm{{EPE}} \downarrow$$\mathrm{{Fl}} \downarrow$WAUC↑
PWC-Net [45]$x$$-$-82.27*2.288*4.889*45.670*
FlowNet2 [17]$x$1$-$6.710*1.040*2.823*90.907*
RAFT [50]$x$4.7880.4486.790*${1.476}^{ * }$3.198*90.920*
GMA [20]$x$4.7630.4437.074*0.914*${3.079}^{ * }$90.722*
RPKNet [37]$x$4.4720.4164.8090.6571.75692.638
DIP [68]$x$4.2730.463---$-$
SKFlow [48]$x$4.5210.408$-$$-$--
GMFlow [56]$x$29.490.93010.355*0.945*2.952*82.337*
GMFlow+ [57]$x$4.2920.433-$-$$-$$-$
Flowformer [14]$x$4.5080.4706.510*0.723*2.384*91.679*
CRAFT [44]$x$4.8030.448$-$--$-$
SEA-RAFT(S)$x$4.0770.415---$-$
SEA-RAFT(M)$x$4.0600.406----
MegaDepth [26]MatchFlow(G) [8]X4.5040.407$-$$-$-$-$
YouTube-VOS [59]Flowformer++ [43]$x$4.4820.447$-$$-$$-$$-$
VIPER [41]MS-RAFT+ [19]$x$3.5770.397${5.724}^{ * }$0.643*2.189*92.888*
TartanAir [52]SEA-RAFT(S)X4.1610.410$-$$-$-$-$
TartanAir [52]SEA-RAFT(M)$x$3.8880.406-$-$-$-$
CroCo-PretrainCroCoFlow [55]$\checkmark$-$-$4.5650.4981.50893.660
CroCo-PretrainWin-Win [24]$\checkmark$-$-$5.3710.4751.62192.270
TartanAir [52]SEA-RAFT(S)$\checkmark$$-$$-$3.9040.3771.38994.182
TartanAir [52]SEA-RAFT(M)$\checkmark$$-$$-$3.6860.3631.34794.534

Table 1: SEA-RAFT outperforms existing methods on Spring [35] in different settings. * denotes the results submitted by Spring [35] team. By default, all methods have undergone " \(\mathrm{C} + \mathrm{T} + \mathrm{S} + \mathrm{K} + {\mathrm{H}}^{\prime \prime }\) training. We list the data used by each method beyond default in the "Extra Data" column. On Spring(test), even our smallest model SEA-RAFT(S) surpasses existing methods by a significant margin. Without fine-tuning on Spring(train), SEA-RAFT outperforms all other methods that do not use extra data.

表1:在不同设置下,SEA-RAFT在Spring [35]上优于现有方法。*表示Spring [35]团队提交的结果。默认情况下,所有方法都经过了“\(\mathrm{C} + \mathrm{T} + \mathrm{S} + \mathrm{K} + {\mathrm{H}}^{\prime \prime }\)训练”。我们在“额外数据”列中列出了每种方法使用的超出默认的数据。在Spring(测试)上,即使是我们最小的模型SEA-RAFT(S)也以显著优势超过了现有方法。在没有对Spring(训练)进行微调的情况下,SEA-RAFT优于所有不使用额外数据的其他方法。

Spring training set for extra 120k steps with a batch size of 32, input resolution \({540} \times {960}\) and learning rate \(4 \times {10}^{-4}\) .

在输入分辨率\({540} \times {960}\)和学习率\(4 \times {10}^{-4}\)下,使用批量大小为32的Spring训练集额外进行120k步的训练。

Metrics We adopt the widely used metrics in this study: endpoint-error (EPE), 1-pixel outlier rate (1px), Fl-score and WAUC error. Definitions can be found in \(\left\lbrack {{12},{35},{41}}\right\rbrack\) .

我们在这项研究中采用了广泛使用的指标:终点误差(EPE)、1像素离群率(1px)、Fl-score和WAUC误差。定义可以在\(\left\lbrack {{12},{35},{41}}\right\rbrack\)中找到。

4.1 Results on Spring

Zero-Shot Evaluation We compare several representative existing methods with SEA-RAFT using the checkpoints and configurations for Sintel [3] submission on the Spring [35] training split. For fair comparisons, we remove the test-time optimizations such as tiling in this setting, which will significantly slow down the inference speed. All experiments follow the same downsample-upsample protocol: We first downsample the 1080p images by 2×,do inference,and then bi-linearly upsample the flow field back to 1080p, which ensures the input resolution in inference is similar to their training resolution in "C+T+S+K+H".

零样本评估 我们使用Sintel [3]提交的检查点和配置,在Spring [35]训练分割上将几个具有代表性的现有方法与SEA-RAFT进行比较。为了公平比较,我们在这个设置中移除了测试时优化,如平铺,这将显著降低推理速度。所有实验都遵循相同的下采样-上采样协议:我们首先将1080p图像下采样2倍,进行推理,然后将流场双线性上采样回1080p,确保推理中的输入分辨率与其在“C+T+S+K+H”中的训练分辨率相似。

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

SEA-RAFT:简单、高效、准确的光流RAFT方法

Extra DataMethodSintelKITTI
Clean $\downarrow$Final $\downarrow$Fl-epe $\downarrow$Fl-all
PWC-Net [45]2.553.9310.433.7
RAFT [50]1.432.715.0417.4
GMA [20]1.302.744.6917.1
SKFlow [48]1.222.464.2715.5
FlowFormer [14]1.012.404.09†14.7 ${}^{ \dagger }$
DIP [68]1.302.824.2913.7
EMD-L [6]0.882.554.1213.5
CRAFT [44]1.272.794.8817.5
RPKNet [37]1.122.45-13.0
GMFlowNet [66]1.142.714.2415.4
SEA-RAFT(M)1.214.044.2914.2
SEA-RAFT(L)1.194.113.6212.9
TartanGMFlow [56]1.082.4811.2*28.7*
GMFlow [56]$-$$-$8.70 (-22%)*24.4 ${\left( -{15}\% \right) }^{ * }$
Tartan $\mathrm{K} + \mathrm{H}$ Tartan+K+HSEA-RAFT(S)1.274.324.6115.8
SEA-RAFT(S)1.27${3.74}\left( {\text{-}{13}\% }\right)$4.4315.1
SEA-RAFT(S)1.32${2.95}\left( {-{32}\% }\right)$--
SEA-RAFT(S)1.30${2.79}\left( {-{35}\% }\right)$--

Table 2: SEA-RAFT achieves the best zero-shot performance on KITTI(train). By default,all methods are trained with "C+T". We list the extra data in the first column. \({}^{ \dagger }\) denotes the method uses tiling in inference. * denotes the GMFlow [56] ablation with 200k training steps.

表2:SEA-RAFT在KITTI(训练集)上实现了最佳的零样本性能。默认情况下,所有方法都使用“C+T”进行训练。我们在第一列列出了额外数据。\({}^{ \dagger }\)表示该方法在推理中使用了平铺技术。*表示使用20万训练步数的GMFlow [56]消融实验。

As shown in Tab. 1, SEA-RAFT achieves the best results among representative existing methods without using extra data, which demonstrates the superiority of our mixture loss and architecture design. When allowed to use extra data, SEA-RAFT falls slightly behind MS-RAFT+ [19] but is \({24} \times\) faster and \({11} \times\) smaller as mentioned in Fig. 1.

如表1所示,SEA-RAFT在不使用额外数据的情况下,在现有代表性方法中取得了最佳结果,这证明了我们的混合损失和架构设计的优越性。当允许使用额外数据时,SEA-RAFT略逊于MS-RAFT+ [19],但正如图1所述,速度更快且模型更小。

Fine-Tuning Test SEA-RAFT ranks 1st on the public test benchmark: SEA-RAFT(M) outperforms all other methods by at least 22.9% on average EPE(endpoint-error) and 17.8% on 1px (1-pixel outlier rate), and SEA-RAFT(S) outperforms other methods by at least \({20.0}\%\) on EPE and \({12.8}\%\) on \(1\mathrm{{px}}\) . Besides the strong performance,our method is notably fast. SEA-RAFT(S) is at least \({2.3} \times\) faster than existing methods which can achieve similar performance. As we still follow the downsample-upsample protocol without using any test-time optimizations in submissions, the inference latency directly reflects our speed in handling 1080p images, which means over 20fps on a single RTX3090.

微调测试 SEA-RAFT在公共测试基准上排名第一:SEA-RAFT(M)在平均端点误差(EPE)和1像素异常率(1px)上分别比所有其他方法至少高出22.9%和17.8%,而SEA-RAFT(S)在EPE和\(1\mathrm{{px}}\)上分别比其他方法至少高出\({20.0}\%\)\({12.8}\%\)。除了强大的性能外,我们的方法还非常快速。SEA-RAFT(S)比现有能够达到类似性能的方法至少快\({2.3} \times\)。由于我们仍然遵循下采样-上采样协议,且在提交中未使用任何测试时优化,因此推理延迟直接反映了我们在处理1080p图像时的速度,这意味着在单个RTX3090上超过20fps。

4.2 Results on Sintel and KITTI

Zero-Shot Evaluation Following previous works, we evaluate the zero-shot Ever performance of SEA-RAFT given training schedule "C+T" on Sintel(train) [3] and KITTI(train) [36]. The results are provided in Tab. 2. On KITTI(train),

零样本评估 遵循先前的工作,我们在Sintel(训练集)[3]和KITTI(训练集)[36]上,使用训练计划“C+T”评估SEA-RAFT的零样本Ever性能。结果在表2中提供。在KITTI(训练集)上,

Extra DataMethodSintelKITTIInference Cost
$\mathrm{{Clean}} \downarrow$Final $\downarrow$Fl-all $\downarrow$Fl-bg $\downarrow$Fl-fg↓#MACsLatency
PWC-Net+ [46]3.454.607.727.697.88101.3G23.82ms
RAFT [50]1.61 *2.86*5.104.746.87938.2G${140.7}\mathrm{{ms}}$
GMA [20]1.39*2.47*5.15--${1352}\mathrm{G}$183.3ms
DIP [68]1.44*2.83*4.213.865.963068G498.9ms
GMFlowNet [66]1.392.654.794.396.841094G244.3ms
GMF low [56]1.742.909.329.677.57602.6G${138.5}\mathrm{{ms}}$
CRAFT [44]1.45*2.42*4.794.585.85${2274}\mathrm{G}$483.4ms
FlowFormer [141.202.124.68†4.3716.1811715G${335.6}\mathrm{{ms}}$
SKF low [48]1.28*2.23*4.854.556.39${1453}\mathrm{G}$331.9ms
GMFlow+ [57]1.032.374.494.275.601177G${249.6}\mathrm{{ms}}$
EMD-L [6]1.322.514.494.166.15${1755}\mathrm{G}$OOM
RPKNet [37]1.312.654.644.634.69137.0G183.3ms
VIPER [41]$\overline{\mathrm{{CCMR}} + }$ [18]1.072.103.863.396.2112653GOOM
MegaDepth [26]MatchFlow(G) [8]1.16*2.37*4.6314.331${6.11}^{ \dagger }$${1669}\mathrm{G}$290.6ms
YouTube-VOS [59]Flowformer++ [43]1.071.944.521--${1713}\mathrm{G}$373.4ms
CroCo-PretrainCroCoFlow [55]${1.09}^{ \dagger }$2.44†3.6413.181${5.94}^{ \dagger }$${57343}{\mathrm{G}}^{ \dagger }$${6422}\mathrm{m{s}^{ \dagger }}$
DDVM-PretrainDDVM [42]1.75 ${}^{ \dagger }$2.4813.2612.90 ${}^{ \dagger }$5.051$-$$-$
TartanAir [52]SEA-RAFT(M)1.442.864.644.475.49486.9G70.96ms
TartanAir [52]SEA-RAFT(L)1.312.604.304.085.37655.1G108.0ms

Table 3: Compared with other methods that achieve competitive performance,SEA \(\frac{}{\text{RAFT is at least 1.8} \times \text{faster on Sintel(test) [3] and 4.6} \times \text{faster on KITTI(test) [36]. All}}\) methods have "C+T+S+K+H" training by default and we list the extra data each method uses in the first column. We measure latency on an RTX3090 with a batch size of 1 and input resolution \({540} \times {960}\) . * denotes the method uses warm-start [50] strategy. \({}^{ \dagger }\) denotes that the corresponding methods use tiling-based test-time optimizations.

表3:与其他实现竞争性能的方法相比,SEA \(\frac{}{\text{RAFT is at least 1.8} \times \text{faster on Sintel(test) [3] and 4.6} \times \text{faster on KITTI(test) [36]. All}}\) 方法默认采用“C+T+S+K+H”训练,我们在第一列列出了每种方法使用的额外数据。我们在RTX3090上以批量大小为1和输入分辨率 \({540} \times {960}\) 测量延迟。* 表示该方法使用了热启动 [50] 策略。\({}^{ \dagger }\) 表示相应的方法使用了基于平铺的测试时优化。

SEA-RAFT outperforms all prior works by a large margin, improving Fl-epe from 4.09 to 3.62 and F1-all from 13.7 to 12.9. On Sintel(train), SEA-RAFT achieves competitive results on the clean pass but, for reasons unclear to us, underperforms existing methods on the final pass. Note that although this "C+T" zero-shot setting is standard, it is of limited relevance to real-world applications, which do not need to restrict the training data to only C+T. Indeed, we show that by adding a small amount of high-quality real-world data (KITTI + HD1K, about 1.2k image pairs compared with 80k image pairs in FlyingThings3D [34]), the performance gap on the Sintel(train) final pass can be remarkably reduced.

SEA-RAFT 在所有先前的工作中表现出色,将 Fl-epe 从 4.09 提高到 3.62,F1-all 从 13.7 提高到 12.9。在 Sintel(训练)上,SEA-RAFT 在干净通道上取得了有竞争力的结果,但由于某些我们不清楚的原因,在最终通道上表现不如现有方法。请注意,尽管这种“C+T”零样本设置是标准的,但它与实际应用的相关性有限,因为实际应用不需要将训练数据限制为仅 C+T。事实上,我们表明,通过添加少量高质量的现实世界数据(KITTI + HD1K,约 1.2k 图像对,而 FlyingThings3D [34] 中有 80k 图像对),可以在 Sintel(训练)最终通道上显著缩小性能差距。

Fine-Tuning Test Results are shown in Tab. 3. Compared with RAFT [50], SEA-RAFT achieves \({19.9}\%\) improvements on the Sintel clean pass, \({4.2}\%\) improvements on the Sintel final pass,and \({15.7}\%\) improvements on KITTI F1-all score. SEA-RAFT is also competitive among all existing methods in terms of performance-speed trade-off: It is the only method that can achieve results better than RAFT [50] with latency around 70ms. On Sintel(test), methods with similar performance are at least \({1.8} \times\) slower than us. On KITTI(test),methods with similar performance are at least \({4.6} \times\) slower than us.

微调测试结果如表 3 所示。与 RAFT [50] 相比,SEA-RAFT 在 Sintel 干净通道上实现了 \({19.9}\%\) 的改进,在 Sintel 最终通道上实现了 \({4.2}\%\) 的改进,在 KITTI F1-all 评分上实现了 \({15.7}\%\) 的改进。在性能-速度权衡方面,SEA-RAFT 在所有现有方法中也具有竞争力:它是唯一一种能够在约 70ms 延迟下实现比 RAFT [50] 更好结果的方法。在 Sintel(测试集)上,性能相似的方法至少比我们慢 \({1.8} \times\)。在 KITTI(测试集)上,性能相似的方法至少比我们慢 \({4.6} \times\)

SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow

SEA-RAFT:简单、高效、准确的光流估计方法

ExperimentInit.Pre-TrainingRNNLoss Design#MACsEPE
Img [7]Tar [52]GRU#blocksTypeParams
SEA-RAFT (w/o Tar.)$\checkmark$$\checkmark$$-$-2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.187
SEA-RAFT (w/ Tar.)$\checkmark$$\checkmark$$\checkmark$-2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.179
$\mathrm{w}/\mathrm{o}\operatorname{Img}$ .$\checkmark$$-$--2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.194
w/o Direct Reg.$x$$\checkmark$--2Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$${277.3}\mathrm{G}$0.201
RAFT GRU$\checkmark$$\checkmark$-$\checkmark$-Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$297.9G0.189
More ConvNeXt Blocks$\checkmark$$\checkmark$$-$-4Mixture-of-Laplace${\beta }_{1} = 0,{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$314.7G0.189
Naive Laplace$\checkmark$$\checkmark$-$-$2Naive Single Laplace Naive Mixture-of-Laplace$\beta \in \left\lbrack {-{10},{10}}\right\rbrack$ ${\beta }_{1},{\beta }_{2} \in \left\lbrack {-{10},{10}}\right\rbrack$284.7G0.217 0.248
L1$\checkmark$$\checkmark$--2${L}_{1}$$x$284.7G0.206
Gaussian$\checkmark$$\checkmark$$-$-2Mixture-of-Gaussian${\sigma }_{1} = 1,{\sigma }_{2} = {e}^{{\beta }_{2}},{\beta }_{2} \in \left\lbrack {0,{10}}\right\rbrack$284.7G0.210

Table 4: We ablate pretraining, direct regression, RNN design, and loss designs on

表 4:我们在

Spring [35] subval. The effect of changes can be identified through comparisons with the first row. See Sec. 4.3 for details.

Spring [35] 子验证集上对预训练、直接回归、RNN 设计和损失设计进行了消融实验。通过与第一行的比较可以识别出变化的影响。详见第 4.3 节。

Fig. 6: More iterations produce lower variance in the Mixture of Laplace, indicating that the model becomes more confident after each iteration.

图 6:更多迭代次数会降低拉普拉斯混合分布的方差,表明模型在每次迭代后变得更加自信。

4.3 Ablations and Analysis

Ablation experiments are conducted on the Spring [35] dataset based on SEA-RAFT(S). We separate a subval set (sequence 0045 and 0047) from the original training set, train our model on the remaining training data and evaluate the performance on subval. The model is trained with a batch size of 32 , input resolution \({540} \times {960}\) ,and tested following "downsample-upsample" protocol mentioned in Sec. 4.1. We describe the details of ablation studies in the following and show the results in Tab. 4:

基于 SEA-RAFT(S) 在 Spring [35] 数据集上进行了消融实验。我们从原始训练集中分离出一个子验证集(序列 0045 和 0047),使用剩余的训练数据训练我们的模型,并在子验证集上评估性能。模型以 32 的批量大小、输入分辨率 \({540} \times {960}\) 进行训练,并按照第 4.1 节提到的“下采样-上采样”协议进行测试。我们在下文中详细描述了消融研究的细节,并在表 4 中展示了结果:

Pretraining We test the performance of TartanAir [52] rigid-flow pre-training on different datasets(see Tabs. 1, 2 and 4 for details). Without TartanAir, SEA-RAFT already provide strong performance, and the rigid-flow pre-training makes it better. We also show that ImageNet pre-trained weights are effective.

预训练 我们测试了TartanAir [52] 刚性流预训练在不同数据集上的表现(详情见表1、2和4)。没有TartanAir,SEA-RAFT已经提供了强大的性能,而刚性流预训练使其更上一层楼。我们还展示了ImageNet预训练权重是有效的。

RNN Design Our new RNN designs can reduce the computation without performance loss compared with the GRU used in RAFT [50]. We also show that on Spring subval, 4 ConvNeXt blocks do not work better than 2 ConvNeXt blocks.

RNN设计 我们的新RNN设计可以在不损失性能的情况下减少计算量,相比于RAFT [50]中使用的GRU。我们还展示了在Spring子集上,4个ConvNeXt块并不比2个ConvNeXt块效果更好。

Fig. 7: Iterative refinements are not hardware-friendly: The latency almost linearly increases with the number of iterations.

图7:迭代细化对硬件不友好:延迟几乎线性地随着迭代次数增加。

Method#ItersLatency (ms)
TotalIter.
RAFT [50]24 (K)11190.3 (82%)
32 (S)141120 (86%)
SEA-RAFT4 (S)47.518.5 $\left( {{39}\% }\right)$
4 (M)70.918.5(26%)
12 (L)10855.5(51%)

Table 5: Compared with RAFT, SEA-RAFT significantly reduces the cost of iterative refinements, which allows larger backbones while still being faster. We use \(\mathrm{K}\) and \(\mathrm{S}\) to denote RAFT submissions on KITTI and Sintel respectively.

表5:与RAFT相比,SEA-RAFT显著降低了迭代细化的成本,这使得在保持更快速度的同时可以使用更大的主干网络。我们用\(\mathrm{K}\)\(\mathrm{S}\)分别表示在KITTI和Sintel上的RAFT提交。

Loss Design We see that naive Laplace regression does worse than the original \({L}_{1}\) loss. We also see that it is important to set \({\beta }_{1}\) to 0 in the MoL loss,which aligns the MoL loss to \({L}_{1}\) for ordinary cases. Besides,we find that the mixture of Gaussian loss does not work well for optical flow, even though it has been found to be useful for image matching [4].

损失设计 我们发现朴素的拉普拉斯回归不如原始的\({L}_{1}\)损失。我们还发现,在MoL损失中将\({\beta }_{1}\)设置为0很重要,这使得MoL损失在普通情况下与\({L}_{1}\)对齐。此外,我们发现高斯混合损失对于光流效果不佳,尽管它已被发现对图像匹配有用[4]。

Direct Regression of Initial Flow We see that the regressed flow initialization significantly improves accuracy without introducing much overhead.

初始流直接回归 我们发现回归的流初始化显著提高了准确性,而没有引入太多开销。

Inference Time Breakdown In Fig. 7, we show how the computational cost increases when we add more refinements. The cost bottleneck for SEA-RAFT is no longer iterative refinements ( Tab. 5), which allows us to use larger backbones given the same computational cost constraint as RAFT [50].

推理时间分解 在图7中,我们展示了当我们增加更多细化时计算成本的增加情况。SEA-RAFT的成本瓶颈不再是迭代细化(见表5),这使得在相同的计算成本约束下可以使用更大的主干网络,如同RAFT [50]。

5 Conclusion

We have introduced SEA-RAFT, a simpler, more efficient and accurate variant of RAFT. It achieves high accuracy across a diverse range of datasets, strong cross-dataset generalization, and state-of-the-art accuracy-speed trade-offs, making it useful for real-world high-resolution optical flow.

我们介绍了 SEA-RAFT,这是 RAFT 的一个更简单、更高效且更准确的变体。它在多种数据集上实现了高精度,具有强大的跨数据集泛化能力,并且在精度和速度之间达到了最先进的平衡,使其适用于现实世界的高分辨率光流计算。

Acknowledgements

This work was partially supported by the National Science Foundation.

这项工作部分得到了国家科学基金会的支持。

References

  1. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. International journal of computer vision 92, 1-31 (2011) 3, 8

  2. Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: International conference on machine learning. pp. 1613-1622. PMLR (2015) 3

  3. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European conference on computer vision. pp. 611- 625. Springer (2012) 1, 3, 8, 9, 10, 11, 12

  4. Chen, H., Luo, Z., Zhou, L., Tian, Y., Zhen, M., Fang, T., Mckinnon, D., Tsin, Y., Quan, L.: Aspanformer: Detector-free image matching with adaptive span transformer. In: European Conference on Computer Vision. pp. 20-36. Springer (2022) \(3,4,5,7,{14}\)

  5. Chen, Q., Koltun, V.: Full flow: Optical flow estimation by global optimization over regular grids. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4706-4714 (2016) 1,3

  6. Deng, C., Luo, A., Huang, H., Ma, S., Liu, J., Liu, S.: Explicit motion disentangling for efficient optical flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9521-9530 (2023) 1, 3, 5, 11, 12

  7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248-255. Ieee (2009) 9, 13

  8. Dong, Q., Cao, C., Fu, Y.: Rethinking optical flow from geometric matching consistent perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1337-1347 (2023) 1, 3, 8, 10, 12

  9. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2758-2766 (2015) 3, 5, 8, 9

  10. Gao, C., Saraf, A., Huang, J.B., Kopf, J.: Flow-edge guided video completion. In: European Conference on Computer Vision. pp. 713-729. Springer (2020) 1

  11. Garrepalli, R., Jeong, J., Ravindran, R.C., Lin, J.M., Porikli, F.: Dift: Dynamic iterative field transforms for memory efficient optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2219- \({2228}\left( {2023}\right) 1,3,5\)

  12. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11), 1231-1237 (2013) \(3,9,{10}\)

  13. Horn, B.K., Schunck, B.G.: Determining optical flow. Artificial intelligence 17(1-3), 185-203 (1981) 1,3

  14. Huang, Z., Shi, X., Zhang, C., Wang, Q., Cheung, K.C., Qin, H., Dai, J., Li, H.: Flowformer: A transformer architecture for optical flow. In: European Conference on Computer Vision. pp. 668-685. Springer (2022) 1, 3, 5, 8, 10, 11, 12

  15. Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Rife: Real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294 (2020) 1

  16. Hui, T.W., Tang, X., Loy, C.C.: Liteflownet: A lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8981-8989 (2018) 3

  17. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2462-2470 (2017) 3, 8, 10

  18. Jahedi, A., Luz, M., Rivinius, M., Bruhn, A.: Ccmr: High resolution optical flow estimation via coarse-to-fine context-guided motion reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6899- 6908 (2024) 3, 5, 12

  19. Jahedi, A., Luz, M., Rivinius, M., Mehl, L., Bruhn, A.: Ms-raft+: High resolution multi-scale raft. International Journal of Computer Vision pp. 1-22 (2023) 2, 3, 5, 10,11

  20. Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. arXiv preprint arXiv:2104.02409 (2021) 10,11,12

  21. Jung, H., Hui, Z., Luo, L., Yang, H., Liu, F., Yoo, S., Ranjan, R., Demandolx, D.: Anyflow: Arbitrary scale optical flow with implicit neural representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5455-5465 (2023) 5

  22. Kim, D., Woo, S., Lee, J.Y., Kweon, I.S.: Deep video inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5792-5801 (2019) 1

  23. Kondermann, D., Nair, R., Honauer, K., Krispin, K., Andrulis, J., Brock, A., Gusse-feld, B., Rahimimoghaddam, M., Hofmann, S., Brenner, C., et al.: The hci benchmark suite: Stereo and flow ground truth with uncertainties for urban autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 19-28 (2016) 3, 9

  24. Leroy, V., Revaud, J., Lucas, T., Weinzaepfel, P.: Win-win: Training high-resolution vision transformers from two windows. arXiv preprint arXiv:2310.00632 (2023) 1, 3, 10

  25. Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11025-11034 (2021) 3

  26. Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2041-2050 (2018) 3, 10, 12

  27. Liu, X., Liu, H., Lin, Y.: Video frame interpolation via optical flow estimation with image inpainting. International Journal of Intelligent Systems 35(12), 2087-2102 (2020) 1

  28. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976-11986 (2022) 9

  29. Lu, Y., Wang, Q., Ma, S., Geng, T., Chen, Y.V., Chen, H., Liu, D.: Transflow: Transformer as flow learner. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18063-18073 (2023) 1

  30. Luo, A., Yang, F., Li, X., Liu, S.: Learning optical flow with kernel patch attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8906-8915 (2022) 3

  31. Luo, A., Yang, F., Li, X., Nie, L., Lin, C., Fan, H., Liu, S.: Gaflow: Incorporating gaussian attention into optical flow. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9642-9651 (2023) 3,8

  32. Luo, A., Yang, F., Luo, K., Li, X., Fan, H., Liu, S.: Learning optical flow with adaptive graph reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2022) 3

  33. Ma, Z., Teed, Z., Deng, J.: Multiview stereo with cascaded epipolar raft. In: European Conference on Computer Vision. pp. 734-750. Springer (2022) 1

  34. Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4040-4048 (2016) 3, 8, 9, 12

  35. Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y., Bruhn, A.: Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4981-4991 (2023) 1, 2, 3, 7, 8, 9, 10, 13

  36. Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3061-3070 (2015) 1, 3, 8, 11, 12

  37. Morimitsu, H., Zhu, X., Ji, X., Yin, X.C.: Recurrent partial kernel network for efficient optical flow estimation (2024) 3, 8, 10, 11, 12

  38. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems \({32}\left( {2019}\right) 9\)

  39. Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9945-9953 (2019) 1

  40. Raistrick, A., Lipson, L., Ma, Z., Mei, L., Wang, M., Zuo, Y., Kayan, K., Wen, H., Han, B., Wang, Y., et al.: Infinite photorealistic worlds using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12630-12641 (2023) 3

  41. Richter, S.R., Hayder, Z., Koltun, V.: Playing for benchmarks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2213-2222 (2017) 3, 10, 12

  42. Saxena, S., Herrmann, C., Hur, J., Kar, A., Norouzi, M., Sun, D., Fleet, D.J.: The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. Advances in Neural Information Processing Systems 36 (2024) 1, 3, 12

  43. Shi, X., Huang, Z., Li, D., Zhang, M., Cheung, K.C., See, S., Qin, H., Dai, J., Li, H.: Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1599-1610 (2023) 1,3,10,12

  44. Sui, X., Li, S., Geng, X., Wu, Y., Xu, X., Liu, Y., Goh, R., Zhu, H.: Craft: Cross-attentional flow transformer for robust optical flow. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 17602- 17611 (2022) 1, 3, 10, 11, 12

  45. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8934-8943 (2018) 1, 3, 9, 10, 11

  46. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Models matter, so does training: An empirical study of cnns for optical flow estimation. IEEE transactions on pattern analysis and machine intelligence \(\mathbf{{42}}\left( 6\right) ,{1408} - {1423}\left( {2019}\right) {12}\)

  47. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8922-8931 (2021) 3, 4, 5, 7

  48. Sun, S., Chen, Y., Zhu, Y., Guo, G., Li, G.: Skflow: Learning optical flow with super kernels. Advances in Neural Information Processing Systems 35, 11313- 11326 (2022) 1, 3, 8, 10, 11, 12

  49. Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: A fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1390-1399 (2018) 1

  50. Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: European conference on computer vision. pp. 402-419. Springer (2020) 1, 3, 4, 5, \(6,9,{10},{11},{12},{13},{14}\)

  51. Truong, P., Danelljan, M., Timofte, R., Van Gool, L.: Pdc-net+: Enhanced probabilistic dense correspondence network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) 3, 4, 5, 7

  52. Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 4909-4916. IEEE (2020) 2, 3, 8, 9, 10, 12, 13

  53. Wannenwetsch, A.S., Keuper, M., Roth, S.: Probflow: Joint optical flow and uncertainty estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 1173-1182 (2017) 3, 5

  54. Weinzaepfel, P., Leroy, V., Lucas, T., Brégier, R., Cabon, Y., Arora, V., Antsfeld, L., Chidlovskii, B., Csurka, G., Revaud, J.: Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion. Advances in Neural Information Processing Systems 35, 3502-3516 (2022) 1, 3

  55. Weinzaepfel, P., Lucas, T., Leroy, V., Cabon, Y., Arora, V., Brégier, R., Csurka, G., Antsfeld, L., Chidlovskii, B., Revaud, J.: Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17969-17980 (2023) 1, 3, 5, 10, 12

  56. Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D.: Gmflow: Learning optical flow via global matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8121-8130 (2022) 1,3,6,10,11,12

  57. Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Yu, F., Tao, D., Geiger, A.: Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023) 1, 3, 10, 12

  58. Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1289-1297 (2017) 3

  59. Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.: Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018) 10, 12

  60. Xu, R., Li, X., Zhou, B., Loy, C.C.: Deep flow-guided video inpainting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3723-3732 (2019) 1

  61. Xu, X., Siyao, L., Sun, W., Yin, Q., Yang, M.H.: Quadratic video interpolation. Advances in Neural Information Processing Systems 32 (2019) 1

  62. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In: Pattern Recognition: 29th DAGM Symposium, Heidelberg, Germany, September 12-14, 2007. Proceedings 29. pp. 214-223. Springer (2007) 1, 3

  63. Zhai, M., Xiang, X., Lv, N., Ali, S.M., El Saddik, A.: Skflow: Optical flow estimation using selective kernel networks. Ieee Access 7, 98854-98865 (2019) 1

  64. Zhang, S., Sun, X., Chen, H., Li, B., Shen, C.: Rgm: A robust generalist matching model. arXiv preprint arXiv:2310.11755 (2023) 3, 5, 6

  65. Zhao, S., Sheng, Y., Dong, Y., Chang, E.I., Xu, Y., et al.: Maskflownet: Asymmetric feature matching with learnable occlusion mask. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6278-6287 (2020) 3,9

  66. Zhao, S., Zhao, L., Zhang, Z., Zhau, E., Metaxas, D.: Global matching with overlapping attention for optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17592-17601 (2022) \(1,3,8,{11},{12}\)

  67. Zhao, Y., Man, K.L., Smith, J., Siddique, K., Guan, S.U.: Improved two-stream model for human action recognition. EURASIP Journal on Image and Video Processing \(\operatorname{2020}\left( 1\right) ,1 - 9\left( {2020}\right) 1\)

  68. Zheng, Z., Nie, N., Ling, Z., Xiong, P., Liu, J., Wang, H., Li, J.: Dip: Deep inverse patchmatch for high-resolution optical flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8925-8934 (2022) \(1,3,{10},{11},{12}\)

  69. Zuo, Y., Deng, J.: View synthesis with sculpted neural points. arXiv preprint arXiv:2205.05869 (2022) 1

posted @ 2024-09-20 09:11  cold_moon  阅读(11)  评论(0编辑  收藏  举报