2、 动态路由算法
- 初始化:初始化一个临时变量b,为一个i×j的全为0的矩阵
- 获取这一步的连接权值c:ci=softmax(bi),将临时变量b通过softmax,保证ci的各分量和为1
- 获取这一步的加权和结果S:$sj = \sum_i c{ij}u_{j|i}$,按这一步连接权值计算加权和
- 非线性激活:vj=squash(sj),经过非线性激活函数,获取这一步的胶囊输出
- 迭代临时变量:$b{ij} = b{ij} + u{i|j} \cdot v{j}$,所这一步的输出与中间变量方向相近,增加临时变量b,即增加权值;若这一步输出与中间变量方向相反,减小临时变量b,即减小权值。
- 若已经迭代到指定次数,输出vj,否侧跳到步骤2
- 对于Tc=1的输出胶囊,当输出向量大于m+时,代价函数为0,否则不为0
- 对于Tc=0的输出胶囊,当输出向量小于m−时,代价函数为0,否则不为0
- 第一层为普通的卷积层,使用9*9卷积,输出通道数为256,输出数据尺寸为20*20*256
- 第二层为卷积层,该卷积层由平行的32个卷积层组成,每个卷积层对应向量数据中的一个向量。每个卷积层均为9*9*256*8(输入channel为256,输出channel为8)。因此输出为6*6*32*8,即窗口大小为6*6,输出channel为32,每个数据为8个分量的向量。
- 第三层为胶囊层,行为类似于全连接层。输入为6*6*32=1152个8分量输入向量,输出为10个16分量的向量,对应的有1152*10个权值,每个权值为8*16的矩阵,最终输出为10个16分量的向量
- 最终输出10个16分量的向量,最终的分类结果是向量长度最大的输出。
class PrimaryCaps(nn.Module):
def forward(self, x):
的卷积层,对应上文所述的第二层卷积层的操作。注意该部分的输出直接被变为[batch size,1152,8]
class DigitCaps(nn.Module):
batch_size = x.size(0)
for iteration in range(num_iterations):
- 第1行计算了softmax函数的结果,对用临时变量b
- 第5行计算加权和
- 第6行计算当前迭代次数的输出
- 第9和10行更新临时向量的值
def margin_loss(self, x, labels, size_average=True):
文字资料参考weakish翻译的Max Pechyonkin的博客:
- 普通卷积层Conv1:基本的卷积层,感受野较大,达到了9x9
- 预胶囊层PrimaryCaps:为胶囊层准备,运算为卷积运算,最终输出为[batch,caps_num,caps_length]的三维数据:
- batch为批大小
- caps_num为胶囊的数量
- caps_length为每个胶囊的长度(每个胶囊为一个向量,该向量包括caps_length个分量)
- 胶囊层DigitCaps:胶囊层,目的是代替最后一层全连接层,输出为10个胶囊
def squash(inputs, axis=-1):
norm = torch.norm(inputs, p=2, dim=axis, keepdim=True)
表示保持原有的空间形状。scale = norm**2 / (1 + norm**2) / (norm + 1e-8)
计算缩放因子,即||S||21+||S||2⋅1||S||return scale * inputs
class PrimaryCapsule(nn.Module):
outputs = self.conv2d(x)
:对输入进行卷积处理,这一步output的形状是[batch,out_channels,p_w,p_h]outputs = outputs.view(x.size(0), -1, self.dim_caps)
:将4D的卷积输出变为3D的胶囊输出形式,output的形状为[batch,caps_num,dim_caps],其中caps_num为胶囊数量,可自动计算;dim_caps为胶囊长度,需要预先指定。return squash(outputs)
def __init__(self, in_num_caps, in_dim_caps, out_num_caps, out_dim_caps, routings=3):
- in_num_caps:输入胶囊的数量
- in_dim_caps:输入胶囊的长度(维数)
- out_num_caps:输出胶囊的数量
- out_dim_caps:输出胶囊的长度(维数)
- routings:动态路由迭代的次数
另外,还定义了权值weight,尺寸为[out_num_caps, in_num_caps, out_dim_caps, in_dim_caps],即每个输出和每个输出胶囊都有连接
def forward(self, x):
x_hat = torch.squeeze(torch.matmul(self.weight, x[:, None, :, :, None]), dim=-1)
x[:, None, :, :, None]
将数据维度从[batch, in_num_caps, in_dim_caps]扩展到[batch, 1,in_num_caps, in_dim_caps,1]torch.matmul()
将weight和扩展后的输入相乘,weight的尺寸是[out_num_caps, in_num_caps, out_dim_caps, in_dim_caps],相乘后结果尺寸为[batch, out_num_caps, in_num_caps,out_dim_caps, 1]torch.squeeze()
x_hat_detached = x_hat.detach()
b = Variable(torch.zeros(x.size(0), self.out_num_caps, self.in_num_caps)).cuda()
- 第一部分是softmax函数,使用
c = F.softmax(b, dim=1)
实现,该步骤不改变b的尺寸 - 第二部分是计算路由结果:
outputs = squash(torch.sum(c[:, :, :, None] * x_hat, dim=-2, keepdim=True))
c[:, :, :, None]
扩展c的维度,以便按位置相乘时广播维度torch.sum(c[:, :, :, None] * x_hat, dim=-2, keepdim=True)
计算出每个胶囊与对应权值的积,即算法中的sj,同时在倒数第二维上求和,则该步输出的结果尺寸为[batch, out_num_caps, 1,out_dim_caps]- 通过激活函数
- 第三部分更新权重
b = b + torch.sum(outputs * x_hat_detached, dim=-1)
,两个按位相乘的变量尺寸分别为[batch, out_num_caps, in_num_caps, out_dim_caps]和[batch, out_num_caps, 1,out_dim_caps],倒数第二维上有广播行为,因此最终结果为[batch, out_num_caps, in_num_caps]
class CapsuleNet(nn.Module):
x = self.relu(self.conv1(x))
def caps_loss(y_true, y_pred, x, x_recon, lam_recon):
1、假设文本的batch_size=32, 通道为1,40个字,每个字embedding_dim=200。
import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Variable def squash(inputs, axis=-1): """ The non-linear activation used in Capsule. It drives the length of a large vector to near 1 and small vector to 0 :param inputs: vectors to be squashed :param axis: the axis to squash :return: a Tensor with same size as inputs """ norm = torch.norm(inputs, p=2, dim=axis, keepdim=True) scale = norm**2 / (1 + norm**2) / (norm + 1e-8) return scale * inputs class DenseCapsule(nn.Module): """ The dense capsule layer. It is similar to Dense (FC) layer. Dense layer has `in_num` inputs, each is a scalar, the output of the neuron from the former layer, and it has `out_num` output neurons. DenseCapsule just expands the output of the neuron from scalar to vector. So its input size = [None, in_num_caps, in_dim_caps] and output size = \ [None, out_num_caps, out_dim_caps]. For Dense Layer, in_dim_caps = out_dim_caps = 1. :param in_num_caps: number of cpasules inputted to this layer :param in_dim_caps: dimension of input capsules :param out_num_caps: number of capsules outputted from this layer :param out_dim_caps: dimension of output capsules :param routings: number of iterations for the routing algorithm """ def __init__(self, in_num_caps, in_dim_caps, out_num_caps, out_dim_caps, routings=3): super(DenseCapsule, self).__init__() self.in_num_caps = in_num_caps self.in_dim_caps = in_dim_caps self.out_num_caps = out_num_caps self.out_dim_caps = out_dim_caps self.routings = routings self.weight = nn.Parameter(0.01 * torch.randn(out_num_caps, in_num_caps, out_dim_caps, in_dim_caps)) def forward(self, x): print(x.shape) #[32, 32, 8] print(x[:, None, :, :, None].shape) #[32, 1, 32, 8, 1] print(self.weight.shape) #[203, 1152, 16, 8] # x.size=[batch, in_num_caps, in_dim_caps] # expanded to [batch, 1, in_num_caps, in_dim_caps, 1] # weight.size =[ out_num_caps, in_num_caps, out_dim_caps, in_dim_caps] # torch.matmul: [out_dim_caps, in_dim_caps] x [in_dim_caps, 1] -> [out_dim_caps, 1] # => x_hat.size =[batch, out_num_caps, in_num_caps, out_dim_caps] x_hat = torch.squeeze(torch.matmul(self.weight, x[:, None, :, :, None]), dim=-1) # In forward pass, `x_hat_detached` = `x_hat`; # In backward, no gradient can flow from `x_hat_detached` back to `x_hat`. x_hat_detached = x_hat.detach() # The prior for coupling coefficient, initialized as zeros. # b.size = [batch, out_num_caps, in_num_caps] b = Variable(torch.zeros(x.size(0), self.out_num_caps, self.in_num_caps)) assert self.routings > 0, 'The \'routings\' should be > 0.' for i in range(self.routings): # c.size = [batch, out_num_caps, in_num_caps] c = F.softmax(b, dim=1) # At last iteration, use `x_hat` to compute `outputs` in order to backpropagate gradient if i == self.routings - 1: # c.size expanded to [batch, out_num_caps, in_num_caps, 1 ] # x_hat.size = [batch, out_num_caps, in_num_caps, out_dim_caps] # => outputs.size= [batch, out_num_caps, 1, out_dim_caps] outputs = squash(torch.sum(c[:, :, :, None] * x_hat, dim=-2, keepdim=True)) # outputs = squash(torch.matmul(c[:, :, None, :], x_hat)) # alternative way else: # Otherwise, use `x_hat_detached` to update `b`. No gradients flow on this path. outputs = squash(torch.sum(c[:, :, :, None] * x_hat_detached, dim=-2, keepdim=True)) # outputs = squash(torch.matmul(c[:, :, None, :], x_hat_detached)) # alternative way # outputs.size =[batch, out_num_caps, 1, out_dim_caps] # x_hat_detached.size=[batch, out_num_caps, in_num_caps, out_dim_caps] # => b.size =[batch, out_num_caps, in_num_caps] b = b + torch.sum(outputs * x_hat_detached, dim=-1) return torch.squeeze(outputs, dim=-2) class PrimaryCapsule(nn.Module): """ Apply Conv2D with `out_channels` and then reshape to get capsules :param in_channels: input channels :param out_channels: output channels :param dim_caps: dimension of capsule :param kernel_size: kernel size :return: output tensor, size=[batch, num_caps, dim_caps] """ def __init__(self, in_channels, out_channels, dim_caps, kernel_size, stride=1, padding=0): super(PrimaryCapsule, self).__init__() self.dim_caps = dim_caps self.conv2d = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding) def forward(self, x): print(x.shape) #[32, 256, 37, 1] outputs = self.conv2d(x) outputs = outputs.view(x.size(0), -1, self.dim_caps) return squash(outputs) class CapsuleNet(nn.Module): """ A Capsule Network on MNIST. :param input_size: data size = [channels, width, height] :param classes: number of classes :param routings: number of routing iterations Shape: - Input: (batch, channels, width, height), optional (batch, classes) . - Output:((batch, classes), (batch, channels, width, height)) """ def __init__(self, input_size, classes, routings): super(CapsuleNet, self).__init__() self.input_size = input_size self.classes = classes self.routings = routings # Layer 1: Just a conventional Conv2D layer self.conv1 = nn.Conv2d(input_size[0], 256, kernel_size=(4, 200), stride=1, padding=0) # Layer 2: Conv2D layer with `squash` activation, then reshape to [None, num_caps, dim_caps] self.primarycaps = PrimaryCapsule(256, 256, 8, kernel_size=(37, 1), stride=2, padding=0) # Layer 3: Capsule layer. Routing algorithm works here. self.digitcaps = DenseCapsule(in_num_caps=32, in_dim_caps=8, out_num_caps=classes, out_dim_caps=16, routings=routings) # Decoder network. self.decoder = nn.Sequential( nn.Linear(16*classes, 512), nn.ReLU(inplace=True), nn.Linear(512, 1024), nn.ReLU(inplace=True), nn.Linear(1024, input_size[0] * input_size[1] * input_size[2]), nn.Sigmoid() ) self.relu = nn.ReLU() def forward(self, x, y=None): x = self.relu(self.conv1(x)) x = self.primarycaps(x) x = self.digitcaps(x) length = x.norm(dim=-1) if y is None: # during testing, no label given. create one-hot coding using `length` index = length.max(dim=1)[1] y = Variable(torch.zeros(length.size()).scatter_(1, index.view(-1, 1).cpu().data, 1.)) reconstruction = self.decoder((x * y[:, :, None]).view(x.size(0), -1)) return length, reconstruction.view(-1, *self.input_size) if __name__ == '__main__': x = torch.rand([16, 1, 40, 200]) m = CapsuleNet([1, 40, 200], 203, 3) y_pred, x_recon = m(x) print(y_pred.shape)
