delving into deep imbalanced regression翻译
非对照翻译,有所简略。翻译不对,尽情谅解,可留言
作者解释 and paper
Abstract
Real-world data often exhibit imbalanced distributions, where certain target values have significantly fewer observations. Existing techniques for dealing with imbalanced data focus on targets with categorical indices, i.e., different classes. However, many tasks involve continuous targets, where hard boundaries between classes do not exist. We define Deep Imbalanced Regression (DIR) as learning from such imbalanced data with continuous targets, dealing with potential missing data for certain target values, and generalizing to the entire target range. Motivated by the intrinsic difference between categorical and continuous label space, we propose distribution smoothing for both labels and features, which explicitly acknowledges the effects of nearby targets, and calibrates both label and learned feature distributions. We curate and benchmark large-scale DIR datasets from common real-world tasks in computer vision, natural language processing, and healthcare domains. Extensive experiments verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for practical imbalanced regression problems. Code and data are available at: [https://github.com/ YyzHarry/imbalanced-regression](https://github.com/ YyzHarry/imbalanced-regression).
现实世界的数据往往是不平衡分布,其中某些target值的观测数据很少。处理不平衡数据的现存方法都侧重于具有分类索引的target(已经标好数据的数据集?数据都标出了类别),如不同的类别。然而,很多任务涉及到连续目标,其中类之间不存在清晰边界。我们称深度不平衡回归(DIR)为从此类含连续目标的不平衡数据中学习,处理某些target值的潜在缺失值,并泛化到整个target范围。受分类和连续标签空间之间固有差异的激发,我们建议对标签和特征的进行平滑分布,这承认了临近目标的影响,并校准标签和学习到的特征分布。我们从计算机视觉、自然语言处理和医疗领域的常见现实世界任务中评估该大型 DIR 数据集。大量实验验证了我们策略的卓越性能。我们的工作填补了实际中的不平衡回归问题的基准数据集和技术方面的空白。
1. Introduction
Data imbalance is ubiquitous and inherent in the real world. Rather than preserving an ideal uniform distribution over each category, the data often exhibit skewed distributions with a long tail (Buda et al., 2018; Liu et al., 2019), where certain target values have significantly fewer observations. This phenomenon poses great challenges for deep recognition models, and has motivated many prior techniques for addressing data imbalance (Cao et al., 2019; Cui et al., 2019; Huang et al., 2019; Liu et al., 2019; Tang et al., 2020).
在现实中,数据的不平衡是常见且固有的。与在每个类别上保持理想的均匀分布不同,数据经常表现出带有长尾的偏态分布,其中某些目标值的观测值十分少。该现象对深度识别模型提出了巨大的挑战,并激发许多现有技术去处理数据不平衡。
Existing solutions for learning from imbalanced data, however, focus on targets with categorical indices, i.e., the targets are different classes. However, many real-world tasks involve continuous and even infinite target values. For example, in vision applications, one needs to infer the age of different people based on their visual appearances, where age is a continuous target and can be highly imbalanced. Treating different ages as distinct classes is unlikely to yield the best results because it does not take advantage of the similarity between people with nearby ages. Similar issues happen in medical applications since many health metrics including heart rate, blood pressure, and oxygen saturation, are continuous and often have skewed distributions across patient populations.
然而,现有从不平衡数据中学习的方案侧重于带分类索引的目标,如目标是不同的类别(有多个目标还是目标有多个label值?应该是每个目标都标出了独立类别)。但,许多现实任务涉及连续且无限的目标值。如,在视觉应用中,需要根据他们的外貌来推断他们的年龄,其中年龄是连续值并且可能高度不平衡。把不同年龄视为独立类别不太可能产生最佳结果,因为它没有利用年龄相近的人的相似性。类似的问题也发生在医疗应用中,因为包括心率、血压和血氧饱和度在内的许多健康指标是连续的,并且在患者群体中通常是偏态分布。
In this work, we systematically investigate Deep Imbalanced Regression (DIR) arising in real-world settings (see Fig. 1). We define DIR as learning continuous targets from natural imbalanced data, dealing with potentially missing data for certain target values, and generalizing to a test set that is balanced over the entire range of continuous target values. This definition is analogous to the class imbalance problem (Liu et al., 2019), but focuses on the continuous setting.
在这项工作中,我们系统地研究了在现实中出现的DIR。我们把DIR定义为从自然的不平衡数据中学习到连续的目标,然后处理某些目标值潜在缺失数据,并泛化到整个连续目标值范围内是平衡的测试集。这个定义类似于类不平衡问题,但侧重于连续。
DIR brings new challenges distinct from its classification counterpart. First, given continuous (potentially infinite) target values, the hard boundaries between classes no longer exist, causing ambiguity when directly applying traditional imbalanced classification methods such as re-sampling and re-weighting. Moreover, continuous labels inherently possess a meaningful distance between targets, which has implication for how we should interpret data imbalance. For example, say two target labels and have a small number of observations in training data. However, is in a highly represented neighborhood (i.e., there are many samples in the range ), while is in a weakly represented neighborhood. In this case, does not suffer from the same level of imbalance as . Finally, unlike classification, certain target values may have no data at all, which motivates the need for target extrapolation & interpolation.
DIR带来了不同于其它分类任务的新挑战。首先,给定了连续(可能无限的)目标值,类间的边界不再存在,从而直接使用传统的不平衡分类方法(如重采样和重加权)时会导致歧义。此外,连续标签在目标之间本身具有有意义的distance,这对我们如何解释数据不平衡有影响。如,假设两个目标标签和在训练集中只有小部分的观测值。然而在一个相当高的表示域内(即,在 内有许多样本,而在一个低的表示域内)在这种例子下,不会遭受与相同程度的不平衡。最后,不同于分类任务,某些目标值可能根本没有数据,这激发了对目标extrapolation 和 interpolation的需求。
In this paper, we propose two simple yet effective methods for addressing DIR: label distribution smoothing (LDS) and feature distribution smoothing (FDS). A key idea underlying both approaches is to leverage the similarity between nearby targets by employing a kernel distribution to perform explicit distribution smoothing in the label and feature spaces. Both techniques can be easily embedded into existing deep networks and allow optimization in an end-to-end fashion. We verify that our techniques not only successfully calibrate for the intrinsic underlying imbalance, but also provide large and consistent gains when combined with other methods. To support practical evaluation of imbalanced regression, we curate and benchmark large-scale DIR datasets for common real-world tasks in computer vision, natural language processing, and healthcare. They range from single-value prediction such as age, text similarity score, health condition score, to dense-value prediction such as depth. We further set up benchmarks for proper DIR performance evaluation.
在本文中,我们提出两个简单高效的方法来解决DIR:标签分布平滑和特征分布平滑。两个方法的关键思想是通过使用核分布在标签和特征空间中执行显式的分布平滑来利用临近目标间的相似性。每个方法都能轻易嵌入到现在的深度网络中并允许以端到端的方式进行优化。我们验证了我们的方法不仅成功地校准了固有的潜在不平衡,而且与其他方法结合时提供了巨大而一致的收益。为了支持不平衡回归的实际评估,我们为计算机视觉、自然语言处理和医疗中的常见现实任务中整理了和基准测试了大规模 DIR 数据集。这些数据集从单值预测(如年龄,文本相似度得分和健康状况得分)到密集值的预测(如深度)。我们进一步为合适的DIR性能评估建立了基准。
Our contributions are as follows:
- We formally define the DIR task as learning from imbalanced data with continuous targets, and generalizing to the entire target range. DIR provides thorough and unbiased evaluation of learning algorithms in practical settings.
- We develop two simple, effective, and interpretable algorithms for DIR, LDS and FDS, which exploit the similarity between nearby targets in both label and feature space.
- We curate benchmark DIR datasets in different domains: computer vision, natural language processing, and healthcare. We set up strong baselines as well as benchmarks for proper DIR performance evaluation.
- Extensive experiments on large-scale DIR datasets verify the consistent and superior performance of our strategies.
我们的贡献如下:
- 我们将DIR任务定义为从带连续目标的不平衡数据中学习,并泛化到整个目标范围内。DIR在实际环境中对学习算法进行完全的公正的评估。
- 我们为DIR、LDS和FDS设计了两个简单高效且可解释的算法,算法利用了标签和特征空间中临近目标的相似性
- 我们在不同领域管理DIR基准数据集。我们为DIR性能评估建立了强大的基线和基准。
- 在大规模DIR数据集上进行的大量实验验证了我们方法的一致性和卓越的性能。
2. Related Work
Imbalanced Classification. Much prior work has focused on the imbalanced classification problem (also referred to as long-tailed recognition (Liu et al., 2019)). Past solutions can be divided into data-based and model-based solutions: Data-based solutions either over-sample the minority class or under-sample the majority (Chawla et al., 2002; Garc´ıa & Herrera, 2009; He et al., 2008). For example, SMOTE generates synthetic samples for minority classes by linearly interpolating samples in the same class (Chawla et al., 2002). Model-based solutions include re-weighting or adjusting the loss function to compensate for class imbalance (Cao et al., 2019; Cui et al., 2019; Dong et al., 2019; Huang et al., 2016; 2019), and leveraging relevant learning paradigms, including transfer learning (Yin et al., 2019), metric learning (Zhang et al., 2017), meta-learning (Shu et al., 2019), and two-stage training (Kang et al., 2020). Recent studies have also discovered that semi-supervised learning and selfsupervised learning lead to better imbalanced classification results (Yang & Xu, 2020). In contrast to these past work, we identify the limitations of applying class imbalance methods to regression problems, and introduce new techniques particularly suitable for learning continuous target values.
不平衡分类。大量的先前工作都集中于不平衡分类问题(也称为长尾识别问题)上。过去的解决方案可以分为基于数据和基于模型:基于数据的方案要么在少数类上过采样或在大多数上缺采样。如,SMOTE为少数类别生成人造样本通过在相同类别的样本中线性插值。基于模型的方案包括重加权或调整损失函数来弥补类别不平衡(yolo的大小目标的超参数),并利用相关的学习范式,包括迁移学习、度量学习、元学习和二阶段训练。最近研究也发现半监督学习和自监督学习会产生不平衡分类问题的好结果。与过去工作相比,我们发现在回归问题上应用类别不平衡方法的局限性,并引入特别适合学习连续目标值的新方法。
Imbalanced Regression. Regression over imbalanced data is not as well explored. Most of the work on this topic is a direct adaptation of the SMOTE algorithm to regression scenarios (Branco et al., 2017; 2018; Torgo et al., 2013). Synthetic samples are created for pre-defined rare target regions by either directly interpolating both inputs and targets (Torgo et al., 2013), or using Gaussian noise augmentation (Branco et al., 2017). A bagging-based ensemble method that incorporates multiple data pre-processing steps has also been introduced (Branco et al., 2018). However, there exist several intrinsic drawbacks for these methods. First, they fail to take the distance between targets into account, and rather heuristically divide the dataset into rare and frequent sets, then plug in classification-based methods. Moreover, modern data is of extremely high dimension (e.g., images and physiological signals); linear interpolation of two samples of such data does not lead to meaningful new synthetic samples. Our methods are intrinsically different from past work in their approach. They can be combined with existing methods to improve their performance, as we show in Sec. 4. Further, our approaches are tested on large-scale real-world datasets in computer vision, NLP, and healthcare.
不平衡回归。在不平衡数据上回归也没有很好的探索过。大多数的工作是SMOTE算法的直接调整到回归场景。通过直接在输入和目标上插值或使用高斯噪声增强技术,来为预定义的稀少目标区域(样本很少的地方)创造人造样本。还引入一个基于bagging的集成方法,它包含多个数据预处理步骤。但是,这些方法存在几个固有的缺陷。首先,它们没有考虑目标间的distance,而是启发式地把数据集分成rare集和frequent集,然后插入基于分类的方法。此外,极高维度的现代数据(如,图片和生理信号)对此类数据的两个样本进行线性插值不会产生有意义的新合成样本。我们的方法本质上同先前方法不同。它们可以结合现有方法来提高性能,如Sec.4所示。此外,我们的方法在视觉、NLP和医疗的大规模数据集上测试过。
3. Methods
** Problem Setting.** Let be a training set, where denotes the input and is the label, which is a continuous target. We introduce an additional structure for the label space , where we divide into groups (bins) with equal intervals, i.e., . Throughout the paper, we use to denote the group index of the target value, where is the index space. In practice, the defined bins reflect a minimum resolution we care for grouping data in a regression task. For instance, in age estimation, we could define , showing a minimum age difference of 1 is of interest. Finally, we denote as the feature for , where is parameterized by a deep neural network model with parameter . The final prediction is given by a regression function that operates over .
问题设置。让 作为训练集,其中表示输入,表示标签,是连续目标值。我们为标签空间引入额外的结构,其中我们把分成等间隔的B组(箱),即。在整个文章中,我们使用表示目标值的组索引,其中 是索引空间。实际上,定义的bins反应了我们在回归任务中对数据分组时关心的最小分辨率。例如,在年龄评估上,我们可以定义,表明最小年龄差为1是有用的。最终,我们把 表示x的特征,其中是通过具有参数的深度神经网络参数化的。最终的预测 是通过 一个在上运行的回归函数给出。
3.1. Label Distribution Smoothing
We start by showing an example to demonstrate the difference between classification and regression when imbalance comes into the picture.
Motivating Example. We employ two datasets: (1) CIFAR100 (Krizhevsky et al., 2009), which is a 100-class classification dataset, and (2) the IMDB-WIKI dataset (Rothe et al., 2018), which is a large-scale image dataset for age estimation from visual appearance. The two datasets have intrinsically different label space: CIFAR-100 exhibits categorical label space where the target is class index, while IMDB-WIKI has a continuous label space where the target is age. We limit the age range to so that the two datasets have the same label range, and subsample them to simulate data imbalance, while ensuring they have exactly the same label density distribution (Fig. 2). We make both test sets balanced. We then train a plain ResNet-50 model on the two datasets, and plot their test error distributions.
我们使用两个数据集:CIFAR100,它是一个100类的分类数据集;IMDB-WIKI,它是从外貌来估计年龄的大规模图片数据集。这两个数据集具有完全不同的标签空间:CIFAR100的类别标签空间中目标值是类别索引,而IMDB-WIKI是目标值是年龄的连续样本空间我们限制年龄到0~99使得两个数据集有相同的标签范围,并且下采样来模拟数据不平衡,同时保持它们具有相同的标签密度分布(图2)。我们保持两个测试集平滑。然后在两个数据集上训练plain ResNet-50,并绘制出它们的测试误差分布。
We observe from Fig. 2(a) that the error distribution correlates with label density distribution. Specifically, the test error as a function of class index has a high negative Pearson correlation with the label density distribution (i.e., -0.76) in the categorical label space. The phenomenon is expected, as majority classes with more samples are better learned than minority classes. Interestingly however, as Fig. 2(b) shows, the error distribution is very different for IMDB-WIKI with continuous label space, even when the label density distribution is the same as CIFAR-100. In particular, the error distribution is much smoother and no longer correlates well with the label density distribution (-0.47).
在图2,我们观察到误差分布与标签密度分布相关。特别的,作为一个类别索引的函数,测试误差与分类标签空间的标签密度分布中有特别高的负Pearson相关性(即-0.76)。这种现象是预期的,因为多数类有更多的样本比少数类更易训练。但有趣的是,如图2(b)所示,误差分布在IMDB-WIKI十分不同,即使当标签密度空间与CIFAR-100相同。特别是误差分布更平滑并且不再跟标签密度分布相关(-0.47)。
The reason why this example is interesting is that all imbalanced learning methods, directly or indirectly, operate by compensating for the imbalance in the empirical label density distribution. This works well for class imbalance, but for continuous labels the empirical density does not accurately reflect the imbalance as seen by the neural network. Hence, compensating for data imbalance based on empirical label density is inaccurate for the continuous label space.
这个例子的原因是所有不平衡学习方法都直接或间接的通过改善经验标签密度分布的不平衡来操作。这对于类不平衡有效,但对于连续标签,经验密度不能准确反应通过神经网络看到的不平衡。因此,对于连续标签空间,基于经验标签密度来改善数据不平衡是不正确的
An empirical distribution is one for which each possible event is assigned a probability derived from experimental observation. It is assumed that the events are independent and the sum of the probabilities is 1.也就是直接观测到的标签密度。
LDS for Imbalanced Data Density Estimation. The above example shows that, in the continuous case, the empirical label distribution does not reflect the real label density distribution. This is because of the dependence between data samples at nearby labels (e.g., images of close ages). In fact, there is a significant literature in statistics on how to estimate the expected density in such cases (Parzen, 1962). Thus, Label Distribution Smoothing (LDS) advocates the use of kernel density estimation to learn the effective imbalance in datasets that corresponds to continuous targets.
用于不平衡数据密度估计的LDS。上述例子表明,在连续情况下,经验标签密度不能反映真实标签密度分布。这是因为相近标签的数据样本之间存在依赖性(如年龄相近的图片)。事实上,关于如何估计在这种情况下的预期密度,统计学有大量的统计文献。因此,LDS提倡使用核密度估计来学习数据集对应连续标签值的有效区域的不平衡。
LDS convolves a symmetric kernel with the empirical density distribution to extract a kernel-smoothed version that accounts for the overlap in information of data samples of nearby labels. A symmetric kernel is any kernel that satisfies: and. Note that a Gaussian or a Laplacian kernel is a symmetric kernel, while is not. The symmetric kernel characterizes the similarity between target values and any w.r.t. their distance in the target space. Thus, LDS computes the effective label density distribution as:
LDS将对称核与经验密度分布卷积来提取核平滑版本,该版本负责相邻标签的数据样本的信息重叠。一个对称核只要满足和就行。注意Gaussian或Laplaian核是对称核,但当时就不是。对称核表示目标值y'和任何y关于(w.r.t=with respect to)目标空间的距离。因此,LDS计算有效的标签密度分布公式为:
(1)
where is the number of appearances of label of y in the training data, and is the effective density of label.
其中p(y)是在训练数据中y标签的出现次数,并且 是标签y'的有效密度
Fig. 3 illustrates LDS and how it smooths the label density distribution. Further, it shows that the resulting label density computed by LDS correlates well with the error distribution (-0.83). This demonstrates that LDS captures the real imbalance that affects regression problems.
图3解释了LDS和它如何平滑标签密度分布。此外,它表明由LDS计算得到的标签密度与误差分布有很好的相关性(-0.83)。这说明了LDS捕获到了影响回归任务的真正不平衡。
Now that the effective label density is available, techniques for addressing class imbalance problems can be directly adapted to the DIR context. For example, a straightforward adaptation can be the cost-sensitive re-weighting method, where we re-weight the loss function by multiplying it by the inverse of the LDS estimated label density for each target. We show in Sec. 4 that LDS can be seamlessly incorporated with a wide range of techniques to boost DIR performance.
既然有效的标签密度是可用的,为了解决类不平衡问题的技术可以直接用于DIR环境。例如,一个简单的调整是关于成本敏感的方法,其中我们通过将损失函数乘上每个目标的LDS估计标签密度的倒数来重新加权损失函数。我们在Sec.4展示了LDS可以无缝结合大量的方法来提高DIR的性能。
3.2. Feature Distribution Smoothing
We are motivated by the intuition that continuity in the target space should create a corresponding continuity in the feature space. That is, if the model works properly and the data is balanced, one expects the feature statistics corresponding to nearby targets to be close to each other.
我们认为目标空间的连续性应该创造一个在特征空间相对应的的连续性。即,如果模型工作正常,数据平衡,则与临近目标对应的特征数据彼此接近。
Motivating Example. We use an illustrative example to highlight the impact of data imbalance on feature statistics in DIR. Again, we use a plain model trained on the images in the IMDB-WIKI dataset to infer a person's age from visual appearance. We focus on the learned feature space, i.e., . We use a minimum bin size of 1, i.e., , and group features with the same target value in the same bin. We then compute the feature statistics (i.e., mean and variance) with respect to the data in each bin, which we denote as . To visualize the similarity between feature statistics, we select an anchor bin , and calculate the cosine similarity of the feature statistics between and all other bins. The results are summarized in Fig. 4 for . The figure also shows the regions with different data densities using the colors purple, yellow, and pink.
激励的例子。我们使用一个说明性的例子来强调数据不平衡在DIR中的特征statistics的影响。再一次,我们使用一个在IMDB-WIKI数据集内的图片训练的空白模型来从外貌推断年龄。我们侧重学习到的特征空间,即z。我们使用最小bin的大小为1,即,并将具有相同目标值的特征组放在相同bin中。然后计算每个bin中的数据的特征统计数据(即均值和方差),将其表示为。为了可视化特征统计数据之间的相似性,我们选择一个anchor bin ,并计算与所有其他bins的特征统计的余弦相似度。在图4总结了结果。该图还使用紫色、黄色和粉红色显示了具有不同数据密度的区域。
Fig. 4 shows that the feature statistics around are highly similar to their values at. Specifically, the cosine similarity of the feature mean and feature variance for all bins between age 25 and 35 are within a few percent from their values at age 30 (the anchor age). Further, the similarity gets higher for tighter ranges around the anchor. Note that bin 30 falls in the high shot region. In fact, it is among the few bins that have the most samples. So, the figure confirms the intuition that when there is enough data, and for continuous targets, the feature statistics are similar to nearby bins. Interestingly, the figure also shows the problem with regions that have very few data samples, like the age range 0 to 6 years (shown in pink). Note that the mean and variance in this range show unexpectedly high similarity to age 30. In fact, it is shocking that the feature statistics at age 30 are more similar to age 1 than age 17. This unjustified similarity is due to data imbalance. Specifically, since there are not enough images for ages 0 to 6, this range thus inherits its priors from the range with the maximum amount of data, which is the range around age 30.
图4表明周围的特征值与它们的在处的值高度相似。具体来说,在25到35之间的所有bins的特征均值和方差的余弦相似度与它们在30岁(anchor age)相差几个百分点。此外,在anchor附近更窄的范围有更高的相似度。注意bin30落在高shot rigion(样本很多的区域)。事实上,它是少数几个有最多的样本的bins之一。所以,图片证明了当有足够的数据,对于连续目标值,在相近的bins有相似的特征值。有趣的是,该图也表明只有很少数据样本的区域的问题,如1-6岁(粉红色显示)。注意在这个区域的均值和方差与30岁的相似度出乎意料的高。事实上,30岁的特征数据居然跟1岁的相似度比17岁要高。这种不合理的相似度是因为数据不平衡。具体来说,由于1-6岁没有足够的图片,因此该范围从数据量最大的范围(即30岁附近)继承其先验值
many-shot region (bins with over 100 training samples), medium-shot region (bins with 20∼100 training samples), and few-shot region (bins with under 20 training samples)
FDS Algorithm. Inspired by these observations, we propose feature distribution smoothing (FDS), which performs distribution smoothing on the feature space, i.e., transfers the feature statistics between nearby target bins. This procedure aims to calibrate the potentially biased estimates of feature distribution, especially for underrepresented target values (e.g., medium- and few-shot groups) in training data. FDS is performed by first estimating the statistics of each bin. Without loss of generality, we substitute variance with covariance to reflect also the relationship between the various feature elements within z:
FDS算法。受这些观测启发,我们提出特征分布平滑(FDS),它在特征空间进行分布平滑,即在相近特征bins传输特征统计数据。此过程旨在校准特征分布潜在的偏差估计,特别是对于训练数据中低代表性的目标值(如中样本组和少样本组)。通过首次估计每个bin的统计数据来运行FDS。不失一般性,我们用协方差代替方差,以反应z内的各种特征元素的关系:
(2)
(3)
where is the total number of samples in b-th bin. Given the feature statistics, we employ again a symmetric kernel to smooth the distribution of the feature mean and covariance over the target bins . This results in a smoothed version of the statistics:
其中是在第b个bin的总样本数。根据特征统计数据,我们再次使用对称核来平衡目标bins的特征均值和协方差分布。由此产生统计数据的平滑版本:
(4)
(5)
With bothand, we then follow the standard whitening and re-coloring procedure (Sun et al., 2016) to calibrate the feature representation for each input sample:
使用and,然后我们遵循标准的白化和重着色程序来校准每个输入样本的特征表示(图片的filters?):
(6)
We integrate FDS into deep networks by inserting a feature calibration layer after the final feature map. To train the model, we employ a momentum update of the running statisticsacross each epoch. Correspondingly, the smoothed statisticsare updated across different epochs but fixed within each training epoch. The momentum update, which performs an exponential moving average (EMA) of running statistics, results in more stable and accurate estimations of the feature statistics during training. The calibrated features are then passed to the final regression function and used to compute the loss.
我们把FDS整合入深度网络通过在最后的特征图后插入一个特征校准层。为了训练模型,我们对每个epoch运行中的统计数据进行动量更新。对应的,平滑统计在不同epoch更新,但在每个训练epoch时固定。动量更新执行运行统计值的指数移动平均(EMA),使得训练期间对特征统计值进行更稳定和准确的估计。然后将校准值 传递给最后的回归函数并用于计算loss。
We note that FDS can be integrated with any neural network model, as well as any past work on improving label imbalance. In Sec. 4, we integrate FDS with a variety of prior techniques for addressing data imbalance, and demonstrate that it consistently improves performance.
我们注意到FDS可以整合到任何神经网络模型内,以及任何过去改善标签不平衡的工作。在Sec.4,我们将FDS与一系列解决数据不平衡的现有技术结合,并证明它能不断地提高性能
4. Benchmarking DIR
Datasets. We curate five DIR benchmarks that span computer vision, natural language processing, and healthcare.Fig. 6 shows the label density distribution of these datasets,and their level of imbalance.
数据集。我们整理了五个DIR基准集,包括CV、NLP、HC。图6展示了这些数据集地标签密度分布和不平衡等级。
- IMDB-WIKI-DIR (age): We construct IMDB-WIKI-DIR using the IMDB-WIKI dataset (Rothe et al., 2018), which contains 523.0K face images and the corresponding ages. We filter out unqualified images, and manually construct balanced validation and test set over the supported ages. The length of each bin is 1 year, with a minimum age of 0 and a maximum age of 186. The number of images per bin varies between 1 and 7149, exhibiting significant data imbalance. Overall, the curated dataset has 191.5K images for training, 11.0K images for validation and testing.
- IMDB-WIKI-DIR(年龄):我们使用IMDB-WIKI数据集构建了IMDB-WIKI-DIR,包含523K张面部图片和对应年龄。我们过滤掉不合格图片并在支持的年龄范围内手动构造平衡的验证和测试集。每个bin的长度是一年,最小年龄为0,最大年龄为186.每个bin的图像数量为1-7149之间,表现出明显的数据不平衡。总的来说,整理后的数据集有191.5K张训练图片,11K张验证和测试图片。
- AgeDB-DIR (age): AgeDB-DIR is constructed in a similar manner from the AgeDB dataset (Moschoglou et al., 2017). It contains 12.2K images for training, with a minimum age of 0 and a maximum age of 101, and maximum bin density of 353 images and minimum bin density of 1. The validation and test set are balanced with 2.1K images.
- AgeDB-DIR(年龄):AgeDB-DIR 以类似的方式从 AgeDB 数据集构建。它包含用于训练的 12.2K 图像,最小年龄为 0,最大年龄为 101,最大 bin 密度为 353 个图像,最小 bin 密度为 1。2.1K张图片的平衡验证集和测试集。
- STS-B-DIR (text similarity score): We construct STS-BDIR from the Semantic Textual Similarity Benchmark (Cer et al., 2017; Wang et al., 2018), which is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is annotated by multiple annotators with an averaged continuous similarity score from 0 to 5. From the original training set of 7.2K pairs, we create a training set with 5.2K pairs, and balanced validation set and test set of 1K pairs each. The length of each bin is 0.1.
- STS-B-DIR(文本相似度得分):我们从语义文本相似度基准数据集构建 STS-BDIR,它是从新闻标题、视频和图像字幕,以及自然语言推理数据中提取的句子对的集合 。 每对由多个注释器进行注释,平均连续相似度得分从 0 到 5。从 7.2K 对的原始训练集,我们构建一个包含 5.2K 对的训练集,以及每个 1K 对的平衡验证集和测试集。 每个 bin 的长度为 0.1。
- NYUD2-DIR (depth): We create NYUD2-DIR based on the NYU Depth Dataset V2 (Nathan Silberman & Fergus, 2012), which provides images and depth maps for different indoor scenes. The depth maps have an upper bound of 10 meters and we set the bin length as 0.1 meter. Following standard practices (Bhat et al., 2020; Hu et al., 2019), we use 50K images for training and 654 images for testing. We randomly select 9357 test pixels for each bin to make the test set balanced.
- NYUD2-DIR(深度):我们基于 NYU Depth Dataset V2创建 NYUD2-DIR,它为不同的室内场景提供图像和深度图。 深度图的上限为 10 米,我们将 bin 长度设置为 0.1 米。 按照标准做法,我们使用 50K 图像进行训练,使用 654 图像进行测试。 我们为每个 bin 随机选择 9357 个测试像素,以使测试集平衡。
- SHHS-DIR (health condition score): We create SHHSDIR based on the SHHS dataset (Quan et al., 1997), which contains full-night Polysomnography (PSG) from 2651 subjects. Available PSG signals include Electroencephalography (EEG), Electrocardiography (ECG), and breathing signals (airflow, abdomen, and thorax), which are used as inputs. The dataset also includes the 36- Item Short Form Health Survey (SF-36) (Ware Jr & Sherbourne, 1992) for each subject, where a General Health score is extracted. The score is used as the target value with a minimum score of 0 and maximum of 100.
- SHHS-DIR(健康状况评分):我们基于 SHHS 数据集创建 SHHSDIR,其中包含来自 2651 名受试者的整晚多导睡眠图(PSG)。 可用的 PSG 信号包括用作输入的脑电图 (EEG)、心电图 (ECG) 和呼吸信号(气流、腹部和胸部)。 该数据集还包括每个受试者的 36 项简短健康调查 (SF-36) ,其中提取了General Health分数。 分数用作目标值,最低分数为 0,最高分数为 100。
Network Architectures. We employ ResNet-50 (He et al., 2016) as our backbone network for IMDB-WIKI-DIR and AgeDB-DIR. Following (Wang et al., 2018), we adopt the same BiLSTM + GloVe word embeddings baseline for STSB-DIR. For NYUD2-DIR, we use ResNet-50-based encoderdecoder architecture introduced in (Hu et al., 2019). Finally, for SHHS-DIR, we use the same CNN-RNN architecture with ResNet block for PSG signals as in (Wang et al., 2019).
网络架构。我们采用ResNet-50作为IMDB-WIKI-DIR和AgeDB-DIR的骨干网络。接下来,我们对STSB-DIR采用了相同的BiLSTM+GloVe词嵌入基线。对于NYUD2-DIR,我们使用(Hu et al., 2019)中介绍的基于ResNet-50的encoderdecoder架构。最后,对于 SHHS-DIR,我们使用同样的带ResNet 块的CNN-RNN 架构来处理PSG信号。
Baselines. Since the literature has only a few proposals for DIR, in addition to past work on imbalanced regression (Branco et al., 2017; Torgo et al., 2013), we adapt a few imbalanced classification methods for regression, and propose a strong set of baselines. Below, we describe the baselines, and how we can combine LDS with each method. For FDS, it can be directly integrated with any baseline as a calibration layer, as described in Sec. 3.2.
基线。由于文献中只有少数关于DIR的建议,除了过去关于不平衡回归的工作,我们采用了一些不平衡回归分类方法,并提出了一组强大的基线。下面,我们将描述基线,以及如何将LDS与每种方法相结合。对于FDS,它可以直接与任何基线集成作为校准层,如第3.2节所述。
- : We use term VANILLA to denote a model that does not include any technique for dealing with imbalanced data. To combine the vanilla model with LDS, we re-weight the loss function by multiplying it by the inverse of the LDS estimated density for each target bin.
- : We choose existing methods for imbalanced regression, including **SMOTER **(Torgo et al., 2013) and **SMOGN **(Branco et al., 2017). SMOTER first defines frequent and rare regions using the original label density, and creates synthetic samples for pre-defined rare regions by linearly interpolating both inputs and targets. SMOGN further adds Gaussian noise to SMOTER. We note that LDS can be directly used for a better estimation of label density when dividing the target space.
- : Inspired by the Focal loss (Lin et al., 2017) for classification, we propose a regression version called Focal-R, where the scaling factor is replaced by a continuous function that maps the absolute error into [0, 1]. Precisely, Focal-R loss based on L1 distance can be written as, where is the L1 error for i-th sample, is the function, and are hyper-parameters. To combine Focal-R with LDS, we multiply the loss with the inverse frequency of the estimated label density.
- : Following (Kang et al., 2020) where feature and classifier are decoupled and trained in two stages, we propose a regression version called regressor re-training (RRT), where in the first stage we train the encoder normally, and in the second stage freeze the encoder and re-train the regressor with inverse re-weighting. When adding LDS, the re-weighting in the second stage is based on the label density estimated through LDS.
- : Since we divide the target space into finite bins, classic re-weighting methods can be directly plugged in. We adopt two re-weighting schemes based on the label distribution: inverse-frequency weighting (INV) and its square-root weighting variant (SQINV). When combining with LDS, instead of using the original label density, we use the LDS estimated target density.
- Vanillamodel:我们使用VANILLA表示不包含处理不平衡数据的任何技术的模型。为了将模型与LDS结合,我们通过将损失函数乘每个目标bin的LDS估计密度的倒数来重加权损失函数。
- Syntheticsample:我们选择现存的不平衡回归方法,包括SMOTER和SMOGN。SMOTER首先使用初始标签密度定义frequent区域和rare区域,并通过对输入和目标进行线性插值来为预定义的rare区域创建样本。SMOGN将高斯噪声加到SMOTER。我们注意到,在划分目标空间时,LDS可以直接使用来更好地估计估计标签密度。
- Error-awareloss:受用于分类的Focal loss启发,我们提出一个Focal-R的回归版本,其中比例因子被一个将绝对误差映射到[0, 1]的连续函数替代。准确的说,基于L1距离的Focal-R损失可以写为,其中时第i个样本的L1误差,是sigmoid函数,是超参数。为了将Focal-R和LDS结合,将损失乘以估计标签密度的频率的倒数。
- Two-stagetraining:继 (Kang et al., 2020) 将特征和分类器解耦并分两个阶段进行训练之后,我们提出了一种称为回归器重训练 (RRT) 的回归版本,在第一阶段我们正常训练编码器,而在第二阶段冻结编码器并使用反向重加权重新训练回归器。添加LDS后,第二阶段的重新加权是基于通过LDS估计的标签密度。
- Cost-sensitive re-weighting:由于我们将目标空间分为有限的bin,因此可以直接插入那些经典的重加权方法。我们采用两种基于标签分布的重加权方案:INV和SQINV。与LDS结合时,我们使用LDS估计的目标密度,而不是原始的目标密度。
Evaluation Process and Metrics. Following (Liu et al., 2019), we divide the target space into three disjoint subsets: many-shot region (bins with over 100 training samples), medium-shot region (bins with 20∼100 training samples), and few-shot region (bins with under 20 training samples), and report results on these subsets, as well as overall performance. We also refer to regions with no training samples as zero-shot, and investigate the ability of our techniques to generalize to zero-shot regions in Sec. 4.2. For metrics, we use common metrics for regression, such as the meanaverage-error (MAE), mean-squared-error (MSE), and Pearson correlation. We further propose another metric, called error Geometric Mean (GM), and is defined asfor better prediction fairness.
评估过程和指标。在 (Liu et al., 2019) 之后,我们将目标空间划分为三个不相交的子集:many-shot 区域(具有超过 100 个训练样本的 bin)、medium-shot 区域(具有 20∼100 个训练样本的 bin)和few-shot区域(训练样本少于 20 个的 bin),并说明这些子集的结果以及整体性能。 我们将没有训练样本的区域称为zero-shot,并在 Sec.4.2中研究我们的技术泛化到该区域的能力。 对于指标,我们使用常用的回归指标,例如均值误差 (MAE)、均方误差 (MSE) 和皮尔逊相关性。 我们进一步提出另一个度量,称为误差几何平均值 (GM),定义为 以获得更好的预测公平性。
4.1. Main Results
We report the main results in this section for all DIR datasets. All training details, hyper-parameter settings, and additional results are provided in Appendix C and D.
Inferring Age from Images: IMDB-WIKI-DIR & AgeDB-DIR. We report the performance of different methods in Table 1 and 2, respectively. For each dataset, we group the baselines into four sections to reflect their different strategies. First, as both tables indicate, when applied to modern high-dimensional data like images, SMOTER and SMOGN can actually degrade the performance in comparison to the vanilla model. Moreover, within each group, adding either LDS, FDS, or both leads to performance gains, while LDS + FDS often achieves the best results. Finally, when compared to the vanilla model, using our LDS and FDS maintains or slightly improves the performance overall and on the many-shot regions, while substantially boosting the performance for the medium-shot and few-shot regions.
从图像推断年龄:IMDB-WIKI-DIR 和 AgeDB-DIR。 我们分别在表 1 和表 2 中报告了不同方法的性能。 对于每个数据集,我们将基线分为四个部分以反映它们的不同策略。 首先,如两个表所示,当应用于现代的高维数据(如图像)时,与普通模型相比,SMOTER 和 SMOGN 实际上会降低性能。 此外,在每个组中,添加 LDS、FDS 或两者都会导致性能提升,而 LDS + FDS 通常会获得最佳结果。 最后,与普通模型相比,使用我们的 LDS 和 FDS 保持或略微提高了整体和many-shot区域的性能,同时显著提高了medium-和few-shot 区域的性能。
Inferring Text Similarity Score: STS-B-DIR. Table 3 shows the results, where similar observations can be made on STS-B-DIR. Again, both SMOTER and SMOGN perform worse than the vanilla model. In contrast, both LDS and FDS consistently and substantially improve the results for various methods, especially in medium- and few-shot regions. The advantage is even more profound under Pearson correlation, which is commonly used for this NLP task.
推断文本相似度得分:STS-B-DIR。 表 3 展示了结果,其中可以对 STS-B-DIR 进行类似的观察。 同样,SMOTER 和 SMOGN 的表现都比vanilla模型差。 相比之下,LDS 和 FDS 都改善了各种方法的结果,尤其是在中等和少样本区域。 在NLP 任务常用的Pearson相关性下,优势更加显著。
Inferring Depth: NYUD2-DIR. For NYUD2-DIR, which is a dense regression task, we verify from Table 4 that adding LDS and FDS significantly improves the results. We note that the vanilla model can inevitably overfit to the manyshot regions during training. FDS and LDS help alleviate this effect, and generalize better to all regions, with minor degradation in the many-shot region but significant boosts for other regions.
推断深度:NYUD2-DIR。 对于密集回归任务 NYUD2-DIR,我们从表 4 中验证了添加 LDS 和 FDS 可以显著改善结果。 我们注意到,vanilla 模型在训练期间不可避免地会过度拟合到manyshot区域。 FDS 和 LDS 有助于减轻这种影响,并更好地泛化到所有区域,在manyshot区域有轻微的退化,但在其他区域有显著提升。
Inferring Health Score: SHHS-DIR. Table 5 reports the results on SHHS-DIR. Since SMOTER and SMOGN are not directly applicable to this medical data, we skip them for this dataset. The results again confirm the effectiveness of both FDS and LDS when applied for real-world imbalanced regression tasks, where by combining FDS and LDS we often get the highest gains over all tested regions.
推断健康评分:SHHS-DIR。 表 5 说明了 SHHS-DIR 的结果。 由于 SMOTER 和 SMOGN 并不直接适用于该医学数据,因此我们在此数据集中跳过它们。 结果再次证实了 FDS 和 LDS 在应用于现实世界的不平衡回归任务时的有效性,通过结合 FDS 和 LDS,我们通常在所有测试区域中获得最高收益。
4.2. Further Analysis
Extrapolation & Interpolation. In real-world DIR tasks, certain target values can have no data at all (e.g., see SHHSDIR and STS-B-DIR in Fig. 6). This motivates the need for target extrapolation and interpolation. We curate a subset from the training set of IMDB-WIKI-DIR, which has no training data in certain regions (Fig. 7), but evaluate on the original testset for zero-shot generalization analysis.
外插和内插。在现实世界的DIR任务中,某些目标值没有数据(看图6的SHHSDIR和SRS-B-DIR)。这激发对目标外插和内插的需求。我们挑选了IMDB-WIKI训练集中的一个子集,该子集在某些区域没有训练数据(图7)进行训练,但在原始测试集上评估来进行零样本泛化分析。
As Table 6 shows, compared to the vanilla model, LDS and FDS can both improve the results not only on regions that have data, but also achieve larger gains on those without data. Specifically, substantial improvements are established for both target interpolation and extrapolation, where interpolation enjoys larger boosts.
如表6所示,跟vanilla模型比较,LDS和FDS不仅在有数据的区域提高性能还能在没有数据的区域取得更大的收益。具体来说,在目标内插和外插都有实质上的提升,尤其在内插中。
We further visualize the absolute MAE gains of our method over vanilla model in Fig. 7. Our method provides a comprehensive treatment to the many, medium, few, as well as zero-shot regions, achieving remarkable performance gains.
我们进一步可视化我们方法的绝对MAE收益在图7中的vanilla模型上。我们的方法对多、中,少,零区域提供全面综合的处理,实现了巨大的收益。
Figure 8. Analysis on how FDS works.** (a) & (b)** Feature statistics similarity for anchor age 0, using model trained without and with FDS. (c) L1 distance between the running statisticsand the smoothed statisticsduring training.
图8:对FDS如何有效的分析。(a)和(b)anchor age 0的特征统计相似性,使用带FDS和不带FDS的训练模型。(c)运行统计数据与平滑统计数据在训练期间的L1距离。
Understanding FDS. We investigate how FDS influences the feature statistics. In Fig. 8(a) and 8(b) we plot the similarity of the feature statistics for anchor age 0, using model trained without and with FDS. As the figure indicates, since age 0 lies in the few-shot region, the feature statistics can have a large bias, i.e., age 0 shares large similarity with region 40 ∼ 80 as in Fig. 8(a). In contrast, when FDS is added, the statistics are better calibrated, resulting in a high similarity only in its neighborhood, and a gradually decreasing similarity score as target value becomes larger. We further visualize the L1 distance between the running statisticsand the smoothed statisticsduring training in Fig. 8(c). Interestingly, the averagedistance becomes smaller and gradually diminishes as the training evolves, indicating that the model learns to generate features that are more accurate even without smoothing, and finally the smoothing module can be removed during inference. We provide more results for different anchor ages in Appendix E.7, where similar effects can be observed.
理解FDS。我们研究了FDS如何影响特征统计值。在图8(a,b)中,画出了使用和没有使用FDS训练后的模型中anchor age 0的特征统计数据(mean、variance)的相似性。如图所示,由于0岁在少样本区间,它的特征统计值由很大的偏差,即图8(a)看出0岁与40-80岁的区域有很大的相似性。相比之下,当加入FDS后,统计数据被很好的校准了,使得仅在其周围具有高相似度,并且随着目标值变大,相似度随之下降。图8(c)可视化了运行统计数据和平滑统计数据间的L1距离。有趣的是,随着训练进行,L1距离变得更小并逐渐平稳下降,这表明了模型即使在没有平滑的情况下学会了生成更准确的特征,最终平滑模块可以在推理阶段移除。我们提供了更多的结果在附录E.7,可以看到相似的效果。
Ablation: Kernel type for LDS & FDS (Appendix E.1). We study the effects of different kernel types for LDS and FDS when applying distribution smoothing. We select three different kernel types, i.e., Gaussian, Laplacian, and Triangular kernel, and evaluate their influences on both LDS and FDS. In general, all kernel types lead to notable gains (e.g., 3.7% ∼ 6.2% relative MSE gains on STS-B-DIR), with the Gaussian kernel often delivering the best results.
消融实验:LDS和FDS核类型(附录E.1)。我们研究了应用分布平滑时不同核类型对LDS和FDS的效果。我们选择了三个不同核类型,即Gaussian,Laplicaian和Triangular,并且在LDS和FDS上评估它们的效果。通常所有核类型都会带来显著效果(如,在STS-B-DIR上的相对MSE收益是3.7%-6.2%),其中Gaussian核往往是最优结果。
Ablation: Different regression loss functions (Appendix E.2). We investigate the influence of different training loss functions on LDS and FDS. We select three common losses used for regression tasks, i.e., loss, MSE loss, and the Huber loss (also referred to as smoothed loss). We find that similar results are obtained for all losses, indicating that both LDS and FDS are robust to different loss functions.
消融实验:不同的回归损失函数(附录E.2)。我们研究 了不同训练时的损失函数对LDS和FDS的影响。我们选择三个常见的用于回归任务的损失函数,即L1 loss,MSE loss和Huber loss(也称为平滑L1 loss)。我们发现对于所有损失函数都获得了相似的结果,这表明 LDS 和 FDS 对不同的损失函数都具有鲁棒性。
Ablation: Hyper-parameter for LDS & FDS (Appendix E.3). We investigate the effects of hyper-parameters on both LDS and FDS. As we mainly employ the Gaussian kernel for distribution smoothing, we extensively study different choices of the kernel size and standard deviation . Interestingly, we find LDS and FDS are surprisingly robust to different hyper-parameters in a given range, and obtain similar gains. For example, on STS-B-DIR withand, overall MSE gains range from 3.3% to 6.2%, withandexhibiting the best results.
消融实验:LDS和FDS的超参数(附录E.3)。我们研究了超参数对 LDS 和 FDS 的影响。 由于我们主要使用高斯核进行分布平滑,我们广泛研究了核大小 l 和标准差 σ 的不同选择。 有趣的是,我们发现 LDS 和 FDS 对给定范围内的不同超参数具有惊人的鲁棒性,并且获得了相似的增益。 例如,在 and 的 STS-B-DIR 上,总体 MSE 增益范围 从 3.3% 到 6.2%,其中 和 表现出最好的结果。
Ablation: Robustness to diverse skewed label densities (Appendix E.4). We curate different imbalanced distributions for IMDB-WIKI-DIR by combining different number of disjoint skewed Gaussian distributions over the target space, with potential missing data in certain target regions, and evaluate the robustness of FDS and LDS to the distribution change. We verify that even under different imbalanced label distributions, LDS and FDS consistently boost the performance across all regions compared to the vanilla model, with relative MAE gains ranging from 8.8% to 12.4%.
消融实验:对不同skewed标签密度的鲁棒性(附录E.4)。我们通过将目标空间上不同数量的不相交skewed高斯分布与某些目标区域中的潜在缺失数据相结合,为 IMDB-WIKI-DIR 管理不同的不平衡分布,并评估 FDS 和 LDS 对分布变化的鲁棒性。 我们验证了即使在不同的不平衡标签分布下,与普通模型相比,LDS 和 FDS 也能持续提升所有区域的性能,相对 MAE 增益从 8.8% 到 12.4% 不等。
Comparisons to imbalanced classification methods (Appendix E.6). Finally, to gain more insights on the intrinsic difference between imbalanced classification & imbalanced regression problems, we directly apply existing imbalanced classification schemes on several appropriate DIR datasets, and show empirical comparisons with imbalanced regression approaches. We demonstrate in Appendix E.6 that LDS and FDS outperform imbalanced classification schemes by a large margin, where the errors for few-shot regions can be reduced by up to 50% to 60%. Interestingly, the results also show that imbalanced classification schemes often perform worse than even the vanilla regression model, which confirms that regression requires different approaches for data imbalance than simply applying classification methods. We note that imbalanced classification methods could fail on regression problems for several reasons. First, they ignore the similarity between data samples that are close w.r.t. the continuous target. Moreover, classification cannot extrapolate or interpolate in the continuous label space, therefore unable to deal with missing data in certain target regions.
与不平衡分类方法的比较(附录E.5)。最后,为了更深入地了解不平衡分类和不平衡回归问题之间的内在差异,我们直接在几个适当的 DIR 数据集上应用现有的不平衡分类方案,并展示与不平衡回归方法的empirical 比较。我们在附录 E.6 中证明,LDS 和 FDS 大大优于那些不平衡分类方案,其中少样本区域的错误可以减少多达 50% 到 60%。有趣的是,结果还表明,不平衡的分类方案通常比普通回归模型表现更差,这证实回归需要不同的方法来解决数据不平衡问题,而不是简单地应用分类方法。我们注意到,不平衡的分类方案可能会因为几个原因而在回归问题上失败。首先,他们忽略了与连续目标接近的数据样本之间的相似性。此外,分类无法在连续标签空间中进行外推或内插,因此无法处理某些目标区域中的缺失数据。
5. Conclusion
We introduce the DIR task that learns from natural imbalanced data with continuous targets, and generalizes to the entire target range. We propose two simple and effective algorithms for DIR that exploit the similarity between nearby targets in both label and feature spaces. Extensive results on five curated large-scale real-world DIR benchmarks confirm the superior performance of our methods. Our work fills the gap in benchmarks and techniques for practical DIR tasks.
我们引入了 DIR 任务,该任务从具有连续目标的自然不平衡数据中学习,并推广到整个目标范围。 我们提出了两种简单有效的 DIR 算法,利用标签和特征空间中临近目标之间的相似性。 五个整理过的大规模真实世界 DIR 基准测试的广泛结果证实了我们方法的卓越性能。 我们的工作填补了实际 Deep Imbalanced Regression (DIR) 任务的基准和技术方面的空白。
Supplementary Material
A. Pseudo Code for LDS & FDS
B. Details of DIR Datasets
In this section, we provide the detailed information of the five curated DIR datasets we used in our experiments. Table 7 provides an overview of the five datasets.
C. Experimental Settings
D. Additional Results
We provide complete evaluation results on the five DIR datasets, where more baselines and evaluation metrics are included in addition to the reported results in the main paper.
E. Further Analysis and Ablation Studies
E.1. Kernel Type for LDS & FDS
E.2. Training Loss for LDS & FDS
E.3. Hyper-parameters for LDS & FDS
。。。。。。。太多了 自行查阅吧