[AAAI2024]AnomalyGPT Detecting Industrial Anomalies Using Large Vision-Language Models

本篇论文将大语言模型应用在工业异常检测（Industrial Anomaly Detection，IAD）任务。

引言

IAD任务旨在检测和定位工业产品图像中的异常。由于现实世界样本的稀有性和不可预测性，要求模型仅在正常样本上进行训练，并实现对测试时异常样本的检测。

如图1，现有的IAD方法给出异常样本的概率，但需要手动设置阈值。大视觉语言模型（Large Vision-Language Models，LVLMs）能够过给出（有时不能给出）图片的描述，但不能实现异常检测。因此本篇文章将大语言模型应用在IAD任务中，提出AnomalyGPT模型（但实际上本文于GPT模型无关）。

方法

图像特征处理

作者考虑两种IAD任务设置：无监督和few-shot。中间公共的部分：输入样本 $x \in R^{W \times H \times C}$ ，被Image Encoder提取的 $C_{1}$ 维度的特征 $F_{i m g} \in R^{C_{1}}$ ，经过线性层投影后的特征为 $E_{i m g} \in R^{C_{e m b}}$

无监督设置下，输入图片没有标签，但会提供两条文本，例如：“A photo of a normal bottle. A photo of an abnormal capsule.” 两条文本经过预训练的Text Encoder后得到特征 $F_{t e x t} \in R^{C_{t e x t}}$ 。

输入文本变化起的格式包含 state-level，对于normal/anomaly的文本分别为文本的格式如下：

normal	anomaly
c:="[o]"	c:="damaged [o]"
c:="flawless [o]"	c:="broken [o]"
...	...

Token [o]可被替换为物体名，若物体名不可用，则使用“object”代替。完整 state-level的prompt template后填入 Template-level的文本，template-level的文本格式包括：

"a cropped photo of the [c]."
"a cropped photo of a [c]."
"a close-up photo of a [c]."
...

经过文本编码器后的文本特征为 $F_{t e x t} \in R^{2 \times C_{t e x t}}$ 。 $x$ 输入图像Encoder后，对于中间4层的特征表示为 $F_{p a t c h}^{i} \in R^{H_{i} \times W_{i} \times C_{i}}$ ，再经过线性层投影得到 ${\tilde{F}}_{p a t c h}^{i} \in R^{H_{i} \times W_{i} \times C_{t e x t}}$ 。

然后将文本特征与图像特征做矩阵乘法并上采样得到Mask： $M = U p s a m p l e (\sum_{i = 1}^{4} s o f t m a x ({\tilde{F}}_{p a t c h}^{i} F_{t e x t}^{T}))$ 。

Few-shot设置下，每个类存在几个带标注的样本，图像Encoder中间4层，分别对应一个容量为 $N$ 的memory bank： $B^{i} \in R^{N \times C_{i}}$ ，然后把整个bank的特征拿来计算，求最大，并把4个结果相加上采样得到掩码mask： $M = U p s a m p l e (\sum_{i = 1}^{4} (1 - m a x (F_{p a t c h}^{i} \cdot {B^{i}}^{T})))$ 。

两种设置下，掩码大小与输入图像一致 $M \in R^{H \times W}$ 。

可学习的prompt embedding层

为了利用图像中的细粒度语义并保持 LLM 和解码器输出之间的语义一致性，引入了一个提示学习器，将定位结果 $M$ 转换为prompt embedding。prompt embedding层输出 $n_{1}$ 的向量： $E_{b a s e} \in R^{n_{1} \times C_{e m b}}$ 。

掩码经过卷积、投影变为长度为 $n_{2}$ 的向量： $M \in R^{H \times W} \to E_{d e c} \in R^{n_{2} \times C_{e m b}}$ 。再concat向量得到： $E_{p r o m p t} \in R^{(n_{1} + n_{2}) \times C_{e m b}}$ 。

用户输入图片的文本描述

为了帮助大模型更好的理解图片内容，可以输入图片描述完善以下内容：
## Human: <img> $E_{i m g}$ </img> $E_{p r o m p t}$ [Image Description ] ls there any anomaly in the image? # Assistant:

每个类含若干个描述，例如：
Bottle：This is a photo of a bottle for anomaly detection, which should be round and without any damage, flaw, defect, scratch, hole or broken part.

然而，在实际生产中，文本描述也可以不提供。作者表示，仅提供 $E_{i m g}$ ，模型也能有较好的表现。

最后，对于大模型的输出大概是“Yes, there is an anomaly in the image, at the bottom left of the image. or No, there are no anomalies in the image.”

为了让模型更好的理解位置，图片被分为9格

损失函数

用交叉熵计算模型生成的文本序列与目标文本序列之间的损失（其中 $n$ 是token的数量）：

L_{c e} = - \sum_{i = 1}^{n} y_{i} l o g (p_{i})

$y_{i}$ 是第i个token真实标签， $p_{i}$ 为第i个token的概率预测。

在IAD任务中，异常图像中的大多数区域仍然是正常的，采用Focal损失可以缓解类别不平衡的问题（其中n=H×W表示像素总数）：

L_{f o c a l} = - \frac{1}{n} \sum_{i = 1}^{n} (1 - p_{i})^{γ} l o g (p_{i}),

并额外使用了Dice损失：

L_{d i c e} = - \frac{\sum_{i = 1}^{n} y_{i} {\hat{y}}_{i}}{\sum_{i = 1}^{n} y_{i}^{2} + \sum_{i = 1}^{n} {\hat{y}}_{i}^{2}},

$y_{i}$ 是图像decoder的输出， ${\hat{y}}_{i}$ 的ground-truth。最后的损失函数表示为：

L = α L_{c e} + β L_{f o c a l} + δ L_{d i c e} .

实验

数据集

MVTec-AD 包含 15 个不同类别的 3629 张训练图像和 1725 张测试图像。
VisA 包含 12 个类别的 9621 张正常图像和 1200 张异常图像。

与之前的IAD方法一致，仅使用这些数据集中的正常数据进行训练。为了模拟异常样本，作者使用poisson图像编辑，相较于裁剪-粘贴，对于边缘处理更平滑。

参考文献

Gu, Zhaopeng, et al. "Anomalygpt: Detecting industrial anomalies using large vision-language models." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 3. 2024.