迁移学习《Energy-based Out-of-distribution Detection》
论文信息
论文标题:Energy-based Out-of-distribution Detection
论文作者:Weitang Liu, XiaoYun Wang, John D. Owens, Yixuan Li
论文来源:NeurIPS 2020
论文地址:download
论文代码:download
引用次数:
1 前言
能量的模型(EBM)本质是建立一个函数 $E(\mathbf{x}): \mathbb{R}^{D} \rightarrow \mathbb{R}$,将输入空间的点 $\mathrm{x}$ 映射到一个称为能量的标量。
能量值的集合可通过吉布斯分布转化为一个概率密度 $p(\mathbf{x})$:
$p(y \mid \mathbf{x})=\frac{e^{-E(\mathbf{x}, y) / T}}{\int_{y^{\prime}} e^{-E\left(\mathbf{x}, y^{\prime}\right) / T}}=\frac{e^{-E(\mathbf{x}, y) / T}}{e^{-E(\mathbf{x}) / T}}\quad\quad\quad(1)$
数据点 $\mathbf{x} \in \mathbb{R}^{D}$ 的亥姆霍兹自由能 $E(\mathbf{x})$ 表示为:
$E(\mathbf{x})=-T \cdot \log \int_{y^{\prime}} e^{-E\left(\mathbf{x}, y^{\prime}\right) / T} \quad\quad\quad(2)$
回顾 $\text{softmax}$ 函数导出分类分布:
$p(y \mid \mathbf{x})=\frac{e^{f_{y}(\mathbf{x}) / T}}{\sum\limits _{i}^{K} e^{f_{i}(\mathbf{x}) / T}} \quad\quad\quad(3)$
由 $\text{Eq.1}$ 和 $\text{Eq.3}$ 得出:
$E(\mathbf{x}, y)=-f_{y}(\mathbf{x})$
同样可得 $\mathbf{x} \in \mathbb{R}^{D}$ 的自由能函数 $E(\mathbf{x} ; f)$:
$E(\mathbf{x} ; f)=-T \cdot \log \sum\limits _{i}^{K} e^{f_{i}(\mathbf{x}) / T} \quad\quad\quad(4)$
2 基于能量的分布外检测
$\mathrm{p}(\mathbf{x})=\frac{\mathrm{e}^{-\mathrm{E}(\mathbf{x} ; \mathrm{f}) / \mathrm{T}}}{\int_{\mathbf{x}} \mathrm{e}^{-\mathrm{E}(\mathbf{x} ; \mathrm{f}) / \mathrm{T}}}\quad\quad\quad(5)$
Note:自由能越大,概率密度越低;
上式取对数:
$\log \mathrm{p}(\mathbf{x})=-\mathrm{E}(\mathbf{x} ; \mathrm{f}) / \mathrm{T}-\underbrace{\log \mathrm{Z}}_{\text {constant for all } \mathrm{x}} \quad\quad\quad(6)$
Note:低能量高似然(ID),高能量低似然(OOD);
通过对自由能设置阈值 $\tau$ 来判断 OOD :
$g(\mathbf{x} ; \tau, f)=\left\{\begin{array}{ll}0 & \text { if }-E(\mathbf{x} ; f) \leq \tau \\1 & \text { if }-E(\mathbf{x} ; f)>\tau\end{array}\right. \quad\quad(7)$
回顾负对数似然损失:
$\mathcal{L}_{\mathrm{nll}}=\mathbb{E}_{(\mathbf{x}, y) \sim P^{\text {in }}}\left(-\log \frac{e^{f_{y}(\mathbf{x}) / T}}{\sum_{j=1}^{K} e^{f_{j}(\mathbf{x}) / T}}\right)\quad\quad(8)$
由于 $E(\mathbf{x}, y)=-f_{y}(\mathbf{x})$,$\text{NLL}$ 损失可以重写为:
$\mathcal{L}_{\text {nll }}=\mathbb{E}_{(\mathbf{x}, y) \sim P^{\text {in }}}\left(\frac{1}{T} \cdot E(\mathbf{x}, y)+\log \sum_{j=1}^{K} e^{-E(\mathbf{x}, j) / T}\right) \quad\quad(9)$
第一项降低 $\text{ground truth}$ $y$ 的能量,第二个项提升了其他标签的能量,从梯度表达式看出:
$\begin{aligned}\frac{\partial \mathcal{L}_{\mathrm{nll}}(\mathbf{x}, y ; \theta)}{\partial \theta} & =\frac{1}{T} \frac{\partial E(\mathbf{x}, y)}{\partial \theta} \\& -\frac{1}{T} \sum_{j=1}^{K} \frac{\partial E(\mathbf{x}, y)}{\partial \theta} \frac{e^{-E(\mathbf{x}, y) / T}}{\sum_{j=1}^{K} e^{-E(\mathbf{x}, j) / T}} \\& =\frac{1}{T} \underbrace{\left(\frac{\partial E(\mathbf{x}, y)}{\partial \theta}(1-p(Y=y) \mid \mathbf{x})\right.}_{\downarrow \text { energy push down for } y} \\& -\underbrace{\sum_{j \neq y} \frac{\partial E(\mathbf{x}, j)}{\partial \theta} p(Y=j \mid \mathbf{x})}_{\uparrow \text { energy pull up for other labels }}) .\end{aligned}\quad\quad(10)$
3 Energy Score vs. Softmax Score
$\begin{aligned} \underset{y}{\text{max}} p(y \mid \mathbf{x}) & =\max _{y} \frac{e^{f_{y}(\mathbf{x})}}{\sum_{i} e^{f_{i}(\mathbf{x})}}=\frac{e^{f^{\max }(\mathbf{x})}}{\sum_{i} e^{f_{i}(\mathbf{x})}} \\& =\frac{1}{\sum_{i} e^{f_{i}(\mathbf{x})-f^{\max }(\mathbf{x})}}\end{aligned} \quad\quad(11)$
把 $\text{Eq.6}$ 带入,令 $T=1$:
$\log \max _{\mathbf{y}} \mathrm{p}(\mathrm{y} \mid \mathbf{x})=-\log \mathrm{p}(\mathbf{x})+\underbrace{\mathrm{f}^{\max }(\mathbf{x})-\log \mathrm{Z}}_{\text {Not constant. Larger for in-dist } \mathbf{x}}\quad\quad(13)$
这里后两项 $\mathrm{f}^{\max }(\mathbf{x})-\log \mathrm{Z}$ 不是一个常数,相反,对于 ID 样本,其负对数似然期望是更小的,但是 $\mathrm{f}^{\max }(\mathrm{x})$ 这个分类置信度却是越大越好,这二种冲突。这一定程度上解释了基于 softmax confidence 方法的问题。
4 能量边界学习
在相同模型上,能量函数已经比 softmax 函数好了,那要是能有针对性的 fine-tuning 一下就更好了。作者就提出了一种能量边界目标函数来 fine-tuning 网络:
$\underset{\theta}{\text{min}} \mathbb{E}_{(\mathbf{x}, \mathrm{y}) \sim \mathcal{D}_{\text {in }}^{\text {tain }}}\left[-\log \mathrm{F}_{\mathrm{y}}(\mathbf{x})\right]+\lambda \cdot \mathrm{L}_{\text {energy }}\quad\quad(14)$
其中第一项是标准的交叉熵损失函数,作用在 ID 训练数据上。第二项是一个基于能量的正则化项:
$\begin{aligned}\mathrm{L}_{\text {energy }} =\quad &\mathbb{E}_{\left(\mathbf{x}_{\text {in }}, \mathrm{y}\right) \sim \mathcal{D}_{\text {in }}^{\text {train }}}\left(\max \left(0, \mathrm{E}\left(\mathbf{x}_{\text {in }}\right)-\mathrm{m}_{\text {in }}\right)\right)^{2} \\&\left.+\mathbb{E}_{\mathbf{x}_{\text {out }} \sim \mathcal{D}_{\text {out }}^{\text {train }}} \max \left(0, \mathrm{~m}_{\text {out }}-\mathrm{E}\left(\mathbf{x}_{\text {out }}\right)\right)\right)^{2}\end{aligned}\quad\quad(15)$
惩罚能量高于 $m_{\text {in }}$ 的 ID 数据和能量低于 $m_{\text {out }}$ 的ODD数据,来拉远正常数据和异常数据分布之间的距离。
整体框架:
代码:
def train(): net.train() # enter train mode loss_avg = 0.0 # start at a random point of the outlier dataset; this induces more randomness without obliterating locality train_loader_out.dataset.offset = np.random.randint(len(train_loader_out.dataset)) for in_set, out_set in zip(train_loader_in, train_loader_out): data = torch.cat((in_set[0], out_set[0]), 0) target = in_set[1] data, target = data.cuda(), target.cuda() # forward x = net(data) # backward scheduler.step() optimizer.zero_grad() loss = F.cross_entropy(x[:len(in_set[0])], target) # cross-entropy from softmax distribution to uniform distribution if args.score == 'energy': Ec_in = -torch.logsumexp(x[:len(in_set[0])], dim=1) Ec_out = -torch.logsumexp(x[len(in_set[0]):], dim=1) # '--m_in', type=float, default=-25., help='margin for in-distribution; above this value will be penalized' # '--m_out', type=float, default=-7.,help='margin for out-distribution; below this value will be penalized') loss += 0.1*(torch.pow(F.relu(Ec_in-args.m_in), 2).mean() + torch.pow(F.relu(args.m_out-Ec_out), 2).mean()) elif args.score == 'OE': loss += 0.5 * -(x[len(in_set[0]):].mean(1) - torch.logsumexp(x[len(in_set[0]):], dim=1)).mean() loss.backward() optimizer.step() # exponential moving average loss_avg = loss_avg * 0.8 + float(loss) * 0.2 state['train_loss'] = loss_avg # test function def test(): net.eval() loss_avg = 0.0 correct = 0 with torch.no_grad(): for data, target in test_loader: data, target = data.cuda(), target.cuda() # forward output = net(data) loss = F.cross_entropy(output, target) # accuracy pred = output.data.max(1)[1] correct += pred.eq(target.data).sum().item() # test loss average loss_avg += float(loss.data) state['test_loss'] = loss_avg / len(test_loader) state['test_accuracy'] = correct / len(test_loader.dataset)
因上求缘,果上努力~~~~ 作者:图神经网络,转载请注明原文链接:https://www.cnblogs.com/BlairGrowing/p/17231667.html