论文解读-RRU-Net: The Ringed Residual U-Net for Image Splicing Forgery Detection
论文解读-RRU-Net: The Ringed Residual U-Net for Image Splicing Forgery Detection
Abstract
- The proposed RRU-Net is an end-to-end image essence attribute segmentation network, which is independent of human visual system, it can accomplish the forgery detection without any preprocessing and post-processing. The core idea of the RRU-Net is to strengthen the learning way of CNN, which is inspired by the recall and the consolidation mechanism ofthe human brain and implemented by the propagation and the feedback process of the residual in CNN. The residual propagation recalls the input feature information to solve the gradient degradation problem in the deeper network; the residual feedback consolidates the input feature information to make the differences ofimage attributes between the un-tampered and tampered regions bemore obvious.
- 该RRU网络是一个独立于人类视觉系统的端到端的图像本质属性分割网络,不需要任何前处理和后处理就可以完成伪造检测。RRU网络的核心思想是强化CNN的学习方式,其灵感来源于人脑的回忆和巩固机制,并通过CNN中残差的传播和反馈过程来实现。残差传播召回输入特征信息,解决深层网络中的梯度退化问题;残差反馈对输入的特征信息进行整合,使未篡改区域和篡改区域的图像属性差异更加明显。
1.Introduction
-
For improving the detected tampered regions, the detection methods [1, 27] use the non-overlapping image patch as the input of CNNs. However, when an image patch totally comes from the tampered regions, this image patch will be judged un-tampered label. In [15], the authors utilize the bigger image patch to reveal the image attributes of the tampered regions, however, the detection method may fail if the forgery image is small. For the existing CNN-based detection methods, since they use the image patch as the input of the network, the contextual spatial information is lost, which easily causes incorrect prediction. Moreover, when the network architecture is deeper, the gradient degradation problem will appear and the discrimination of features will become weaker, which will lead to the splicing forgery detection more difficult or even fail.
为了改进检测到的篡改区域,检测方法 [1, 27] 使用非重叠图像块作为 CNN 的输入。但是,当一个图像块完全来自被篡改区域时,该图像块将被判断为未篡改标签。在[15]中,作者利用较大的图像块来揭示被篡改区域的图像属性,但是如果伪造图像很小,检测方法可能会失败。对于现有的基于 CNN 的检测方法,由于它们使用图像块作为网络的输入,因此会丢失上下文空间信息,从而容易导致错误的预测。而且,当网络架构更深时,会出现梯度退化问题,特征的辨别能力会变弱,这会导致拼接伪造检测更加困难甚至失败。
-
For overcoming the drawbacks of traditional feature extraction-based methods, meanwhile, further solving the problems of current CNN-based detection methods, a ringed residual U-Net (RRU-Net) is proposed in this paper. RRU-Net is an end-to-end image essence attribute segmentation network, which is independent of human visual system, it can directly locate the forgery regions without any preprocessing and post-processing. Furthermore, RRU-Net can effectively decrease incorrect prediction since it makes better use of the contextual spatial information in a image.
And most of all, the ringed residual structure in RRU-Net can strengthen the learning way of CNN and simultaneously prevent the gradient degradation problem of deeper network, which ensure the discrimination of image essence
attribute features be more obvious while the features are extracted among layers of network.为了克服传统基于特征提取的方法的缺点,同时进一步解决当前基于CNN的检测方法存在的问题,本文提出了一种环形残差U-Net(RRU-Net)。 RRU-Net是一种端到端的图像本质属性分割网络,它独立于人类视觉系统,无需任何预处理和后处理即可直接定位伪造区域。此外,RRU-Net 可以有效地减少错误预测,因为它更好地利用了图像中的上下文空间信息。
最重要的是,RRU-Net中的环状残差结构可以加强CNN的学习方式,同时防止更深网络的梯度退化问题,保证在层间提取特征的同时,对图像本质属性特征的区分更加明显。 的网络。
3. The Ringed Residual U-Net (RRU-Net)
3.1. Residual Propagation
According to the discussion above, the differences of image essence attributes are the significant basis for detecting image splicing forgery, however, the gradient degradation problem will destroy the basis when the network architecture gets deeper. For solving the gradient degradation problem, we add the residual propagation to each stacked layers. A building block is shown in Fig. 2, which consists of two convolutional (dilated convolution [31], dconv) layers and residual propagation. The output of the building block is defined as:
where, \(x\) and \(y_{f}\) are the input and output of the building block, \(W_{i}\) represents the weights of layer $ i $, the function \(F\left(x,\left\{W_{i}\right\}\right)\) represents the residual mapping to be learned. For the example in Fig. 2 that has two convolutional layers,\(F=W_{2} \sigma\left(W_{1} * x\right)\) in which \(\sigma\) denotes ReLU [19] and the biases are omitted for simplifying notations. The linear projection \(W_{s}\) is used to change the dimension of x to match the dimension of \(F\left(x,\left\{W_{i}\right\}\right)\) . The operation $ F + W_{s} * x$ is performed by a shortcut connection and element-wise addition.
The residual propagation looks like the recall mechanism of the human brain. We may forget the previous knowledge when we learn several more new knowledge, so we need
the recall mechanism to help us arouse those previous fuzzy memories.
3.2. Residual Feedback
It is obvious that, in splicing forgery detection, if the differences of image essence attributes between the un-tampered and tampered regions can be further strengthened, the performance of the detection can be further improved. In [36], the proposed method superposes the additional difference of noise attribute by passing the forgery imag through an SRM filter layer to enhance detection results. The SRM filter layer has a certain effect, however, it is a manual choosing method and can only for the RGB image forgery detection. Moreover, when the un-tampered and tampered regions come from the cameras with the same brand and model, the SRM filter layer will reduce effectiveness sharply, since they have same noise attribute. For further strengthening the differences of image essence attributes, the residual feedback is proposed, which is an automatic learning method and not just focus on one or several
specific image attributes. Furthermore, we design a simple and effective attention mechanism, which take advantage of ideas of Hu et al. [9], and then we add it on the residual feedback to pay more attention to the discriminative features of input information. In this attention mechanism, we opt to employ a simple gating mechanism with a sigmoid activation function to learn a nonlinear interaction between
discriminative feature channels and avoid diffusion of feature information, and then we superpose the response values obtained by sigmoid activation on input information to
amplify differences of image essence attributes between the un-tampered and tampered regions. The residual feedback in a building block is shown Fig. 3 and is defined as Eq.(3),
where, \(x\) is the input, \(y_{f}\) is the output of residual propagation defined in Eq.(2), \(y_{b}\) is the enhanced input. The function G is a linear projection, which is used to change the dimensions of \(y_{f}\). The function \(s\) is a sigmoid activation function.In contrast to the recall mechanism imitated by the residual propagation, the residual feedback seems to act as the consolidation mechanism of the human brain, we need to consolidate the knowledge already learned by us to obtain the new feature comprehensionp. The residual feedback can amplify the differences of image essence attributes between the un-tampered and tampered regions in the input, as shown in Fig. 1.(c), the tampered region ’eagle’ is am- plified to global maximal response values by the residual feedback. Furthermore, it also has two far-reaching effects:
(1) the strengthening of the discriminative features can simultaneously be viewed as the repression of the negative label features;
(2) the convergence rate of network in the training process is more fast.
3.3. Ringed Residual Structure and Network Archi-tectures
-
The proposed ringed residual structure that combines the residual propagation and the residual feedback is shown in Fig. 4.
所提出的结合了残差传播和残差反馈的环形残差结构如图4所示。
-
To sum up, the ringed residual structure guarantees the discrimination of image essence attribute features be more obvious while the features are extracted among layers of network, which can achieve better and stable detection performance than traditional feature extraction-based detection methods and existing CNN-based detection methods.
综上所述,环状残差结构在网络各层之间提取特征的同时,保证了图像本质属性特征的判别更加明显,与传统的基于特征提取的检测方法和现有的基于CNN的检测方法相比,能够获得更好、稳定的检测性能。RRU-Net的网络架构如图5所示,它是一个端到端的图像本质属性分割网络,无需任何预处理和后处理即可直接检测拼接伪造。
4.1. Detection at Pixel Level
4.2. Detection at Image Level
5. Conclusion
-
In this paper, we propose a ringed residual U-Net (RRU-Net) for image splicing forgery detection, which is an end-to-end image essence property segmentation network and can achieve the forgery detection without any preprocessing and post-processing. Inspiring by the recall and consolidation mechanisms of the human brain, the proposed RRU-Net strengthens the learning way of CNN by the propagation and feedback process of the residual. Simultaneously,
we also prove the validity of the ringed residual structure in RRU-Net from theoretical analysis and experimental comparison. We will further explore and visualize the latent discriminative feature between tampered and un-tampered regions to explain the key issues of image splicing forgery detection in our future works.在本文中,我们提出了一种用于图像拼接伪造检测的环形残差U-Net(RRUNet),它是一种端到端的图像本质属性分割网络,无需任何预处理和后处理即可实现伪造检测。 受人脑回忆和巩固机制的启发,所提出的 RRUNet 通过残差的传播和反馈过程加强了 CNN 的学习方式。 同时,我们还通过理论分析和实验比较证明了 RRU-Net 中环状残差结构的有效性。 我们将进一步探索和可视化篡改和未篡改区域之间的潜在判别特征,以解释我们未来工作中图像拼接伪造检测的关键问题。
模型结构Pytorch实现
import torch
import torch.nn as nn
import torch.nn.functional as F
# ~~~~~~~~~~ U-Net ~~~~~~~~~~
class U_double_conv(nn.Module):
def __init__(self, in_ch, out_ch):
super(U_double_conv, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True)
)
def forward(self, x):
x = self.conv(x)
return x
class inconv(nn.Module):
def __init__(self, in_ch, out_ch):
super(inconv, self).__init__()
self.conv = U_double_conv(in_ch, out_ch)
def forward(self, x):
x = self.conv(x)
return x
class U_down(nn.Module):
def __init__(self, in_ch, out_ch):
super(U_down, self).__init__()
self.mpconv = nn.Sequential(
nn.MaxPool2d(kernel_size=2, stride=2),
U_double_conv(in_ch, out_ch)
)
def forward(self, x):
x = self.mpconv(x)
return x
class U_up(nn.Module):
def __init__(self, in_ch, out_ch, bilinear=True):
super(U_up, self).__init__()
# would be a nice idea if the upsampling could be learned too,
# but my machine do not have enough memory to handle all those weights
if bilinear:
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
else:
self.up = nn.ConvTranspose2d(in_ch // 2, in_ch // 2, 2, stride=2)
self.conv = U_double_conv(in_ch, out_ch)
def forward(self, x1, x2):
x1 = self.up(x1)
diffX = x2.size()[2] - x1.size()[2]
diffY = x2.size()[3] - x1.size()[3]
x1 = F.pad(x1, (diffY, 0,
diffX, 0))
x = torch.cat([x2, x1], dim=1)
x = self.conv(x)
return x
# ~~~~~~~~~~ RU-Net ~~~~~~~~~~
class RU_double_conv(nn.Module):
def __init__(self, in_ch, out_ch):
super(RU_double_conv, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch),
nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=1),
nn.BatchNorm2d(out_ch))
def forward(self, x):
x = self.conv(x)
return x
class RU_first_down(nn.Module):
def __init__(self, in_ch, out_ch):
super(RU_first_down, self).__init__()
self.conv = RU_double_conv(in_ch, out_ch)
self.relu = nn.ReLU(inplace=True)
self.res_conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False),
nn.BatchNorm2d(out_ch))
def forward(self, x):
# the first ring conv
ft1 = self.conv(x)
r1 = self.relu(ft1 + self.res_conv(x))
return r1
class RU_down(nn.Module):
def __init__(self, in_ch, out_ch):
super(RU_down, self).__init__()
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.conv = RU_double_conv(in_ch, out_ch)
self.relu = nn.ReLU(inplace=True)
self.res_conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 1, bias=False),
nn.BatchNorm2d(out_ch))
def forward(self, x):
x = self.maxpool(x)
# the first ring conv
ft1 = self.conv(x)
r1 = self.relu(ft1 + self.res_conv(x))
return r1
class RU_up(nn.Module):
def __init__(self, in_ch, out_ch, bilinear=False):
super(RU_up, self).__init__()
# would be a nice idea if the upsampling could be learned too,
# but my machine do not have enough memory to handle all those weights
# nn.Upsample hasn't weights to learn, but nn.ConvTransposed2d has weights to learn.
if bilinear:
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
else:
self.up = nn.ConvTranspose2d(in_ch // 2, in_ch // 2, 2, stride=2)
self.conv = RU_double_conv(in_ch, out_ch)
self.relu = nn.ReLU(inplace=True)
self.res_conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False),
nn.GroupNorm(32, out_ch))
def forward(self, x1, x2):
x1 = self.up(x1)
diffX = x2.size()[2] - x1.size()[2]
diffY = x2.size()[3] - x1.size()[3]
x1 = F.pad(x1, (diffY, 0,
diffX, 0))
x = torch.cat([x2, x1], dim=1)
# the first ring conv
ft1 = self.conv(x)
r1 = self.relu(self.res_conv(x) + ft1)
return r1
# ~~~~~~~~~~ RRU-Net ~~~~~~~~~~
class RRU_double_conv(nn.Module):
def __init__(self, in_ch, out_ch):
super(RRU_double_conv, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 3, padding=2, dilation=2),
nn.GroupNorm(32, out_ch),
nn.ReLU(inplace=True),
nn.Conv2d(out_ch, out_ch, 3, padding=2, dilation=2),
nn.GroupNorm(32, out_ch)
)
def forward(self, x):
x = self.conv(x)
return x
class RRU_first_down(nn.Module):
def __init__(self, in_ch, out_ch):
super(RRU_first_down, self).__init__()
self.conv = RRU_double_conv(in_ch, out_ch)
self.relu = nn.ReLU(inplace=True)
self.res_conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False),
nn.GroupNorm(32, out_ch)
)
self.res_conv_back = nn.Sequential(
nn.Conv2d(out_ch, in_ch, kernel_size=1, bias=False)
)
def forward(self, x):
# the first ring conv
ft1 = self.conv(x)
r1 = self.relu(ft1 + self.res_conv(x))
# the second ring conv
ft2 = self.res_conv_back(r1)
x = torch.mul(1 + torch.sigmoid(ft2), x)
# the third ring conv
ft3 = self.conv(x)
r3 = self.relu(ft3 + self.res_conv(x))
return r3
class RRU_down(nn.Module):
def __init__(self, in_ch, out_ch):
super(RRU_down, self).__init__()
self.conv = RRU_double_conv(in_ch, out_ch)
self.relu = nn.ReLU(inplace=True)
self.pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.res_conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 1, bias=False),
nn.GroupNorm(32, out_ch))
self.res_conv_back = nn.Sequential(
nn.Conv2d(out_ch, in_ch, kernel_size=1, bias=False))
def forward(self, x):
x = self.pool(x)
# the first ring conv
ft1 = self.conv(x)
r1 = self.relu(ft1 + self.res_conv(x))
# the second ring conv
ft2 = self.res_conv_back(r1)
x = torch.mul(1 + torch.sigmoid(ft2), x)
# the third ring conv
ft3 = self.conv(x)
r3 = self.relu(ft3 + self.res_conv(x))
return r3
class RRU_up(nn.Module):
def __init__(self, in_ch, out_ch, bilinear=False):
super(RRU_up, self).__init__()
if bilinear:
self.up = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
else:
self.up = nn.Sequential(
nn.ConvTranspose2d(in_ch // 2, in_ch // 2, 2, stride=2),
nn.GroupNorm(32, in_ch // 2))
self.conv = RRU_double_conv(in_ch, out_ch)
self.relu = nn.ReLU(inplace=True)
self.res_conv = nn.Sequential(
nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False),
nn.GroupNorm(32, out_ch))
self.res_conv_back = nn.Sequential(
nn.Conv2d(out_ch, in_ch, kernel_size=1, bias=False))
def forward(self, x1, x2):
x1 = self.up(x1)
diffX = x2.size()[2] - x1.size()[2]
diffY = x2.size()[3] - x1.size()[3]
x1 = F.pad(x1, (diffY, 0,
diffX, 0))
x = self.relu(torch.cat([x2, x1], dim=1))
# the first ring conv
ft1 = self.conv(x)
r1 = self.relu(self.res_conv(x) + ft1)
# the second ring conv
ft2 = self.res_conv_back(r1)
x = torch.mul(1 + torch.sigmoid(ft2), x)
# the third ring conv
ft3 = self.conv(x)
r3 = self.relu(ft3 + self.res_conv(x))
return r3
# !!!!!!!!!!!! Universal functions !!!!!!!!!!!!
class outconv(nn.Module):
def __init__(self, in_ch, out_ch):
super(outconv, self).__init__()
self.conv = nn.Conv2d(in_ch, out_ch, 1)
def forward(self, x):
x = self.conv(x)
return x
from models.rrunet_parts import *
import torch.nn as nn
class Unet(nn.Module):
def __init__(self, n_channels, n_classes):
super(Unet, self).__init__()
self.inc = inconv(n_channels, 64)
self.down1 = U_down(64, 128)
self.down2 = U_down(128, 256)
self.down3 = U_down(256, 512)
self.down4 = U_down(512, 512)
self.up1 = U_up(1024, 256)
self.up2 = U_up(512, 128)
self.up3 = U_up(256, 64)
self.up4 = U_up(128, 64)
self.out = outconv(64, n_classes)
def forward(self, x):
x1 = self.inc(x)
x2 = self.down1(x1)
x3 = self.down2(x2)
x4 = self.down3(x3)
x5 = self.down4(x4)
x = self.up1(x5, x4)
x = self.up2(x, x3)
x = self.up3(x, x2)
x = self.up4(x, x1)
x = self.out(x)
return x
class Res_Unet(nn.Module):
def __init__(self, n_channels, n_classes):
super(Res_Unet, self).__init__()
self.down = RU_first_down(n_channels, 32)
self.down1 = RU_down(32, 64)
self.down2 = RU_down(64, 128)
self.down3 = RU_down(128, 256)
self.down4 = RU_down(256, 256)
self.up1 = RU_up(512, 128)
self.up2 = RU_up(256, 64)
self.up3 = RU_up(128, 32)
self.up4 = RU_up(64, 32)
self.out = outconv(32, n_classes)
def forward(self, x):
x1 = self.down(x)
x2 = self.down1(x1)
x3 = self.down2(x2)
x4 = self.down3(x3)
x5 = self.down4(x4)
x = self.up1(x5, x4)
x = self.up2(x, x3)
x = self.up3(x, x2)
x = self.up4(x, x1)
x = self.out(x)
return x
class Ringed_Res_Unet(nn.Module):
def __init__(self, n_channels=3, n_classes=1):
super(Ringed_Res_Unet, self).__init__()
self.down = RRU_first_down(n_channels, 32)
self.down1 = RRU_down(32, 64)
self.down2 = RRU_down(64, 128)
self.down3 = RRU_down(128, 256)
self.down4 = RRU_down(256, 256)
self.up1 = RRU_up(512, 128)
self.up2 = RRU_up(256, 64)
self.up3 = RRU_up(128, 32)
self.up4 = RRU_up(64, 32)
self.out = outconv(32, n_classes)
def forward(self, x):
x1 = self.down(x)
x2 = self.down1(x1)
x3 = self.down2(x2)
x4 = self.down3(x3)
x5 = self.down4(x4)
x = self.up1(x5, x4)
x = self.up2(x, x3)
x = self.up3(x, x2)
x = self.up4(x, x1)
x = self.out(x)
return x