交叉熵损失CrossEntropyLoss

在各种深度学习框架中，我们最常用的损失函数就是交叉熵，熵是用来描述一个系统的混乱程度，通过交叉熵我们就能够确定预测数据与真实数据的相近程度。交叉熵越小，表示数据越接近真实样本。

1 分类任务的损失计算

1.1 单标签分类

二分类

　　单标签任务，顾名思义，每个样本只能有一个标签，比如ImageNet图像分类任务，或者MNIST手写数字识别数据集，每张图片只能有一个固定的标签。二分类是多分类任务中的一个特例，因为二分类只有正样本和负样本，并且两者的概率之和为1，所以不需要预测一个向量，只需要输出一个概率值就好了。损失函数一般是输出经过sigmoid激活函数之后，采用交叉熵损失函数计算loss。

　　以上面猫狗二分类任务为例，网络最后一层的输出应该理解为：网络认为图片中含有这一类别物体的概率。而每一类的真实标签都只有两种可能值，即“不是这一类物体”和“是这一类物体”，这是一个二项分布，可能的取值为0或者1，而网络预测的分布可以理解为标签是1的概率。当网络的输出logits=2时，经过sigmoid得到为狗的概率是0.9，交叉熵损失loss=-1×log(0.9) - 0×log(0.1) ≈ 0.1。

多分类　

　　在多分类任务中，利用softmax函数将多个神经元（神经元数目为类别数）输出的结果映射到对于总输出的占比（范围0~1，占比可以理解成概率值），我们通过选择概率最大输出类别作为预测类别。

上面为三分类任务，输出的logits向量对应三个类别，经过softmax后得到三个和为1的概率[0.9,0.1,0]。样本“剪刀”对应的真实分布为[1,0,0]，此时计算损失函数得loss = -1*log(0.9) - 0×log(0.1) - 0×log(0) ≈ 0.1。如果网络输出的概率为[0.1,0.9,0]，此时的交叉熵损失为loss= -1*log(0.1) - 0×log(0.9) - 0×log(0)= 1。上述两种情况对比，第一个分布的损失明显低于第二个分布的损失，说明第一个分布更接近于真实分布，事实也确实是这样。

1.2 多标签分类

　　多标签分类任务，即一个样本可以有多个标签，比如一张图片中同时含有“猫”和“狗”，这张图片就同时拥有属于“猫”和“狗”的两种标签。在这种情况下，我们将函数作为网络最后一层的输出，把网络最后一层的每个神经元都看做任务中的一个类别，以图像识别任务为例，网络最后一层的输出应该理解为：网络认为图片中含有这一类别物体的概率。而每一类的真实标签都只有两种可能值，即“图片中含有这一类物体”和“图片中不含有这一类物体”，这是一个二项分布。综上所述，对多分类任务中的每一类单独分析的话，真实分布是一个二项分布，可能的取值为0或者1，而网络预测的分布可以理解为标签是1的概率。此外，由于多标签分类任务中，每一类是相互独立的，所以网络最后一层神经元输出的概率值之和并不等于1。

上面的多标签分类任务有三个标签：狗，猫，猪。输入图片中没有猪，所以真实分布应该为：[ 1, 1, 0 ] 。

假设经过右图的网络输出的概率分布为：[ 0.95, 0.73, 0.05]，则我们可以对狗，猫，猪这三类都计算交叉熵损失函数，然后将它们相加就得到这一张图片样本的交叉熵损失函数值。

loss狗=-1×log(0.95)-(1-1)×log(1-0.95)≈0.05

loss猫=-1×log(0.73)-(1-1)×log(1-0.73)≈0.31

loss猪=-0×log(0.05)-(1-0)×log(1-0.05)≈0.05

loss总=loss狗+loss猫+loss猪=0.05+0.31+0.05=0.41

假设经过右图的网络输出的概率分布为：[ 0.3, 0.5, 0.7]，交叉熵损失损失为

loss狗=-1×log(0.3)-(1-1)×log(1-0.3)≈1.2

loss猫=-1×log(0.5)-(1-1)×log(1-0.5)≈0.7

loss猪=-0×log(0.7)-(1-0)×log(1-0.7)≈1.2

loss总=loss狗+loss猫+loss猪=1.2+0.7+1.2=3.1

由上面两种情况也可以看出，预测分布越接近真实分布，交叉熵损失越小，预测分布越远离真实分布，交叉熵损失越大。

2 损失函数的pytorch实现

Pytorch关于损失函数的内容，可以在官方文档torch.nn — PyTorch 1.10 documentation里找到。

2.1 nn.BCEloss

BCEloss主要用于计算标签只有1或者0时的二分类损失，标签和预测值是一一对应的。需要注意的是，通过nn.BCEloss来计算损失前，需要对预测值进行一次sigmoid计算。sigmoid函数会将预测值映射到0-1之间。如果觉得手动加sigmoid函数麻烦，可以直接调用nn.BCEwithlogitsloss。

# class
torch.nn.BCELoss(weight=None, size_average=None, reduce=None, reduction='mean')

# function
torch.nn.functional.binary_cross_entropy(input, target, weight=None, size_average=None, reduce=None, reduction='mean')

input(Tensor) – 任意维度的张量
target(Tensor) – 和输入一样的shape，但值必须在0-1之间
weight(Tensor,optional) – 人为给定的权重
size_average(bool,optional) – 已弃用
reduce(bool,optional) – 已弃用
reduction(str,optional) – none：求 minibatch 中每个sample的loss值，不做归并；mean：对 minibatch 中所有sample 的loss值求平均；sum：对 minibatch 中所有sample的loss值求和。

当 reduction = none时，

其中N表示batch_size,若reduction不为none时，

示例

import numpy as np
import torch
import torch.nn.functional as F

input = torch.Tensor([[0.6, 0.1], [0.3, 0.8]])
target = torch.Tensor([[0, 1], [1, 0]])

loss = F.binary_cross_entropy(input, target)  
# loss : tensor(1.5081)

loss = torch.sum(-(target * torch.log(input) + (1 - target) * torch.log(1 - input))) / 4 
# loss : tensor(1.5081)

loss = -(np.log(0.4) + np.log(0.1) + np.log(0.3) + np.log(0.2)) / 4  
# 1.5080716354070594

loss = F.binary_cross_entropy(torch.sigmoid(input), target)
# loss : tensor(0.8518)

loss = F.binary_cross_entropy_with_logits(input, target)
# loss : tensor(0.8518)

2.2 nn.CrossEntropyLoss

使用神经网络模型时，调整输出层的单元数，当进行n分类（n>2）时，设置输出层的单元数为n，采用softmax损失函数（把输出层整体转换为0-1之间的概率分布）+多分类交叉熵损失。把标签转换为one-hot向量，每个样本的标签是一个n维向量，其所属类别位置为1，其余位置为0。

#CLASS
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)

#FUNCTION
torch.nn.functional.cross_entropy(input, target, weight=None, size_average=None, ignore_index=- 100, reduce=None, reduction='mean', label_smoothing=0.0)

input（Tensor）–在2D情况下输入尺寸为(N, C, H, W)，在K≥1时，输入尺寸为 (N, C, d1, d2, ..., dK) 。
target（Tensor）- 其中每个值是0≤target[i]≤C-1，在K≥1时，target的尺寸为(N, d1, d2, ..., dK)。
weight (Tensor,optional) – 对每个类别的手动重新缩放权重。如果给定，则必须是大小为C的张量
size_average(bool,optional) – 不推荐使用。默认：True
ignore_index ( int,optional) – 指定一个被忽略且对输入梯度没有贡献的目标值。当size_average为时 True，损失在未忽略的目标上取平均值。默认值：-100
reduce ( bool,optional) – 不推荐使用。默认：True
reduction(string,optional) – 指定应用于输出的缩减： 'none'| 'mean'| 'sum'. 'none': 不会应用减少, 'mean': 输出的总和将除以输出中的元素数, 'sum': 输出将被求和。注意：size_average 和reduce正在被弃用，同时，指定这两个参数中的任何一个都将覆盖reduction. 默认：'mean'

cross_entropy的pytorch实现

def cross_entropy(input, target, weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean'):
    if size_average is not None or reduce is not None:
        reduction = _Reduction.legacy_get_string(size_average, reduce)
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)

可以看出softmax + log + NLLloss = crossEntropyLoss。

softmax

多分类问题（分类种类为c个）在经过输出层的计算后，会产生c个输出x，softmax的作用就是将输出x转化为和为1的概率问题。它是二分类函数sigmoid在多分类上的推广，目的是将多分类的结果以概率的形式展现出来。其定义如下：

概率是非负且和为1的，因此softmax首先将模型的预测结果转化到指数函数上，这样保证了概率的非负性。再将转换后的结果归一化处理，使得各预测结果的概率之和等于1。比如三分类预测结果为[3,1,-3]，求指数得[20.09,2.72,0.05]，归一化后得[0.88,0.12,0]。

当输入x中存在特别大的xi当输入x中每个元素都为特别小的负数时，分母

其中zmax是输入z中的最大值，对于任何一个zi，减去zmax后，

NLLloss

NLLloss输入是一个对数概率向量和一个目标标签，也就是将上面的输出中与label对应的那个值拿出来，去掉负号再求均值。不用对label进行one_hot编码，因为nll_loss函数已经实现了类似one-hot过程：直接在log(softmax(input))矩阵中，取出每个样本的target值对应的下标位置（该位置在onehot中为1，其余位置在onehot中为0）。

示例 NLLloss

import torch
import torch.nn.functional as F

# 1D
input = torch.Tensor([[2, 3, 1], [3, 7, 9]])
target = torch.tensor([1, 2])
loss = F.nll_loss(input, target)
#loss: tensor(-6.)

# 2D
input = torch.Tensor([[[2, 3],
                       [1, 5]],
                      [[3, 7],
                       [1, 9]]])
target = torch.tensor([[1, 1],
                       [0, 0]])
loss = F.nll_loss(input, target)
tensor(-4.)

在一维时，nllloss对两个向量的操作为，将input中的向量，在target中对应的index取出，并取负号输出。target中为1，则取2，3，1中的第1位3，target第二位为2，则取出3，7，9的第2位9，将两数取平均(当reduction='mean'时)后加负号后输出。在二维时（输入是图片），loss=[（-3）+（-7）+（-1）+（-5）]/4 = -4。

3 损失函数的weight参数

　　损失函数中的weight参数用于调节不同类别样本占比差异很大的现象，比如语义分割中，背景的像素比缺陷的像素多很多，在计算loss的时候两类别loss直接相加会导致模型对背景的过拟合。在分类中，ok的样本过多而ng样本过少，当它们的比值大于10的时候要考虑样本不平衡问题。假设有两类，标签类别为0, 1，所对应的样本数量为1000，10。在网络学习的过程中，假设预测出来的标签都是0（100000个样本），它的准确率为1000/1010 ≈ 0.99，将近100%，所以模型就会朝着拟合标签0的方向更新，导致对标签0的样本过拟合，对1类别的样本欠拟合，泛化能力很差。

如何解决？

对于训练图像数量较少的类，给它更多的权重，这样如果网络在预测这些类的标签时出错，就会受到更多的惩罚。
对于具有大量图像的类，可以赋予它较小的权重。

3.1 cross_entropy函数中的weight参数

cross_entropy函数中的weight参数可以在分类问题中给不同的类别不同的权重。

示例 cross_entropy函数中的weight参数

import torch
import torch.nn.functional as F
input = torch.Tensor([[[1, 2], [3, 5]], [[4, 7], [4, 6]]])  # torch.Size([2, 2, 2])
target = torch.tensor([[0, 0], [1, 1]])  # torch.Size([2, 2])
weight = torch.tensor([1.0, 9.0])
loss = F.cross_entropy(input, target)  # tensor(1.7955)
loss = F.cross_entropy(input, target, weight) # tensor(1.1617)

3.2 binary_cross_entropy函数中的weight参数

pytorch官方对weight给出的解释是“如果提供，则重复该操作以匹配输入张量形状”，也就是说给出weight参数后，会将其shape和input的shape相匹配。默认情况，也就是weight=None时，上述公式中的Wn=1；当weight!=None时，也就意味着我们需要为每一个样本赋予权重Wi。

示例 binary_cross_entropy函数中的weight参数

import torch
import torch.nn.functional as F

input = torch.rand(3, 3)  
target = torch.rand(3, 3).random_(2)
w = [0.1, 0.9] # 标签0和标签1的权重
weight = torch.zeros(target.shape)  # 权重矩阵
for i in range(target.shape[0]):
    for j in range(target.shape[1]):
        weight[i][j] = w[int(target[i][j])]
loss = F.binary_cross_entropy(input, target, weight=weight)

"""
# input
tensor([[0.1531, 0.3302, 0.7537],
        [0.2200, 0.6875, 0.2268],
        [0.5109, 0.5873, 0.9275]])
# target
tensor([[1., 0., 0.],
        [0., 0., 1.],
        [0., 1., 0.]])
# weight
tensor([[0.9000, 0.1000, 0.1000],
        [0.1000, 0.1000, 0.9000],
        [0.1000, 0.9000, 0.1000]])
# loss
tensor(0.4621)
"""

4 在二分类任务中输出1通道后sigmoid还是输出2通道softmax？

当语义分割任务是二分类时，有两种情况（1）最后一个卷积层直接输出1通道的feature map，做sigmoid后用binary_cross_entropy函数计算损失（2）最后一个卷积层输出2channel的feature map，在通道维度做softmax，然后利用cross_entropy计算损失。这两种方法哪一个更好？

4.1 理论

知乎链接：https://www.zhihu.com/question/295247085/answer/1778398778

首先我们先理论上证明一下二者没有本质上的区别，对于二分类而言（以输入x1

Sigmoid函数：

Softmax函数：

4.2 实验

代码：WZMIAOMIAO/deep-learning-for-image-processing/pytorch_segmentation/unet/

DRIVE数据集：https://pan.baidu.com/s/1Tjkrx2B9FgoJk0KviA-rDw 密码: 8no8

视频讲解：https://www.bilibili.com/video/BV1Vq4y127fB

使用Up主霹雳吧啦Wz的UNet代码测试，源代码输出2通道后进行softmax。对网络进行以下改动，将其改为输出1通道，并使用相同的评价指标。

损失部分

# 二通道损失计算
def criterion(inputs, target, loss_weight=None, num_classes: int = 2, dice: bool = False, ignore_index: int = -100):
    losses = {}
    for name, x in inputs.items():
        # 忽略target中值为255的像素，255的像素是目标边缘或者padding填充
        loss = nn.functional.cross_entropy(x, target, ignore_index=ignore_index, weight=loss_weight)
        losses[name] = loss

    if len(losses) == 1:
        return losses['out']
    return losses['out'] + 0.5 * losses['aux']

# 一通道损失计算
def criterion(inputs, target, loss_weight=None, num_classes: int = 2, dice: bool = False, ignore_index: int = -100):
    losses = {}
    for name, x in inputs.items():
        # 将不关心区域(255)置为0
        roi_mask = torch.eq(target, ignore_index)
        target[roi_mask] = 0
　　　　 # reshape后target和x维度相同
        target = target.reshape(-1).float()
        x = x.reshape(-1).float()
        loss = nn.functional.binary_cross_entropy_with_logits(x, target)
        losses[name] = loss

    if len(losses) == 1:
        return losses['out']
    return losses['out'] + 0.5 * losses['aux']

评价指标

#二通道
def evaluate(model, data_loader, device, num_classes):
    model.eval()
    metric_logger = utils.MetricLogger(delimiter="  ")
    header = 'Test:'
    with torch.no_grad():
        for image, target in metric_logger.log_every(data_loader, 100, header):
            image, target = image.to(device), target.to(device)
            output0 = model(image)['out']
            output1 = torch.softmax(output0, dim=1)
            output2 = output1.argmax(dim=1).float() #只计算前景的dice_coeff
            target = target.float()
            dice = dice_coeff(output2, target, ignore_index=255)

    return dice.item()

#一通道
def evaluate(model, data_loader, device, num_classes):
    model.eval()
    dice = utils.DiceCoefficient(num_classes=num_classes, ignore_index=255)
    metric_logger = utils.MetricLogger(delimiter="  ")
    header = 'Test:'
    with torch.no_grad():
        for image, target in metric_logger.log_every(data_loader, 100, header):
            image, target = image.to(device), target.to(device)
            output = model(image)['out'] # 输出的是0-1之间的概率
            output[output > 0.5] = 1
            output[output < 1] = 0
            target = target.float()
            dice = dice_coeff(output, target, ignore_index=255)

predict　

# 二通道
...
output = model(img.to(device))
prediction = output['out'].argmax(1).squeeze(0)
prediction = prediction.to("cpu").numpy().astype(np.uint8)
# 将前景对应的像素值改成255(白色)
prediction[prediction == 1] = 255
# 将不感兴趣的区域像素设置成0(黑色)
prediction[roi_img == 0] = 0
mask = Image.fromarray(prediction)

# 一通道
...
output = model(img.to(device))
prediction = torch.sigmoid(output['out']).squeeze(0).squeeze(0)
prediction = prediction.to("cpu").numpy()
# 将前景对应的像素值改成255(白色)
prediction[prediction > 0.5] = 1
prediction[prediction < 1] = 0
# 将不感兴趣的区域像素设置成0(黑色)
prediction[roi_img == 0] = 0
mask = Image.fromarray((prediction*255).astype(np.uint8))

batch_size取16，不使用dice_loss，训练150epoch后，效果差不多。推理时间也一致。