siamrpn代码细节

1. 训练

1.1 数据预处理：

得到视频每一帧的图片和对应xml标注文件(和目标检测类似，每个目标的box和name)，将图片中的每个目标切割出来，存储为子图。截取过程如下：

以目标中心点为中心，长和宽分别为s_x，截取图片并缩放到(instance_size, instance_size)；
instance_size默认为255，gluoncv中也设置为511

截取代码如下：

# image: 原始视频中每一帧的图片，box为图片中目标的box， x为截取出来的子图，保存x为图片
def crop_like_SiamFC(image, bbox, context_amount=0.5, exemplar_size=127, instance_size=255, padding=(0, 0, 0)):
    """
    Dataset curation and avoid image resizing during training
    if the tight bounding box has size (w, h) and the context margin is p,
    then the scale factor s is chosen such that the area of the scaled rectangle is equal to a constant
    s(w+2p)×s(h+2p)=A.

    Parameters
    ----------
    image: np.array, image
    bbox: list or np.array, bbox
    context_amount: float, the amount of context to be half of the mean dimension
    exemplar_size: int, exemplar_size
    instance_size: int, instance_size

    Return:
        crop result exemplar z and instance x
    """
    target_pos = [(bbox[2]+bbox[0])/2., (bbox[3]+bbox[1])/2.]
    target_size = [bbox[2]-bbox[0], bbox[3]-bbox[1]]
    wc_z = target_size[1] + context_amount * sum(target_size)
    hc_z = target_size[0] + context_amount * sum(target_size)
    s_z = np.sqrt(wc_z * hc_z)
    scale_z = exemplar_size / s_z
    d_search = (instance_size - exemplar_size) / 2
    pad = d_search / scale_z
    s_x = s_z + 2 * pad

    z = crop_hwc(image, pos_s_2_bbox(target_pos, s_z), exemplar_size, padding)
    x = crop_hwc(image, pos_s_2_bbox(target_pos, s_x), instance_size, padding)
    return z, x
    
def pos_s_2_bbox(pos, s):
    """
    from center_x,center_y,s to get bbox

    Parameters
    ----------
    pos , x, bbox
    s , int, bbox size

    Return:
        [x_min,y_min,x_max,y_max]
    """
    return [pos[0]-s/2, pos[1]-s/2, pos[0]+s/2, pos[1]+s/2]

def crop_hwc(image, bbox, out_sz, padding=(0, 0, 0)):
    """
    crop image

    Parameters
    ----------
    image: np.array, image
    bbox: np or list, bbox coordinate [xmin,ymin,xmax,ymax]
    out_sz: int , crop image size

    Return:
        crop result
    """
    a = (out_sz-1) / (bbox[2]-bbox[0])
    b = (out_sz-1) / (bbox[3]-bbox[1])
    c = -a * bbox[0]
    d = -b * bbox[1]
    mapping = np.array([[a, 0, c],
                        [0, b, d]]).astype(np.float)
    cv2 = try_import_cv2()
    crop = cv2.warpAffine(image, mapping, (out_sz, out_sz), borderMode=cv2.BORDER_CONSTANT, borderValue=padding)
    return crop

1.2 数据加载和增强

通过上面预处理，对于目标x，在视频1-100帧图片中都有目标x，会切出100张x的子图：image_x_0.jpg, image_x_1.jpg,......,image_x_100.jpg
对于目标x，假设加载image_x_20.jpg(第20帧)，将其作为tempate_image, 在20帧的前后10帧随机挑选一张作为search_image
对于tempate_image进行数据增强，主要是随机平移，缩放，颜色变换，模糊，翻转，最后缩放到(127, 127)
对于search_image进行数据增强，主要是随机平移，缩放，颜色变换，模糊，翻转，最后缩放到(255, 255)
根据search_image的bbox计算gt_label, 包括cls, delta, delta_weight，其中cls表示每个anchor是正样本，负样本和忽略样本，delta维度为4，表示 bbox中心点相对于anchor中心点的偏移值dx, dy,以及bbox的宽高w，h；delta_weight表示正样本权重，值为1\num_pos。有两点值得注意：
- anchor映射为search_image只包括search_image中间128x128的区域，所以设置的anchor左上角中心点为(63, 63), 右下角为(63+128, 63+128)
- 为了平衡正负样本的比例，会根据anchor和gt_box的IOU挑选64个样本给RPN网络学习，（计算所有anchor和gt_box的IOU, IOU>0.6的为正样本，IOU<0.3的为负样本）

（search_image尺寸为255x255，中间[63, 63, 63+128, 63+128]的区域降采样8倍后，对应17x17的feature_map）

数据增强部分的代码

# data augmentation
self.template_aug = SiamRPNaugmentation(self.template_shift,
                                                self.template_scale,
                                                self.template_blur,
                                                self.template_flip,
                                                self.template_color)
self.search_aug = SiamRPNaugmentation(self.search_shift,
                                              self.search_scale,
                                              self.search_blur,
                                              self.search_flip,
                                             self.search_color)
# augmentation
template, _ = self.template_aug(template_image,
                                        template_box,
                                        self.train_exemplar_size,  # 默认为127
                                        gray=gray)
search, bbox = self.search_aug(search_image,
                                       search_box,
                                       self.train_search_size, # 默认为255
                                       gray=gray)

class SiamRPNaugmentation:
    """dataset Augmentation for SiamRPN tracking.

    Parameters
    ----------
    shift : int
        length of template augmentation shift
    scale : float
        template augmentation scale ratio
    blur : float
        template augmentation blur ratio
    flip : float
        template augmentation flip ratio
    color : float
        template augmentation color ratio
    """
    def __init__(self, shift, scale, blur, flip, color):
        self.shift = shift
        self.scale = scale
        self.blur = blur
        self.flip = flip
        self.color = color
        self.rgbVar = np.array([[-0.55919361, 0.98062831, - 0.41940627],
                                [1.72091413, 0.19879334, - 1.82968581],
                                [4.64467907, 4.73710203, 4.88324118]], dtype=np.float32)
        self.cv2 = try_import_cv2()

    @staticmethod
    def random():
        return np.random.random() * 2 - 1.0

    def _crop_roi(self, image, bbox, out_sz, padding=(0, 0, 0)):
        """crop image roi size.

        Parameters
        ----------
        image : np.array
            image
        bbox : list or np.array
            bbox coordinate，like (xmin,ymin,xmax,ymax)
        out_sz : int
            size after crop

        Return:
            image after crop
        """
        bbox = [float(x) for x in bbox]
        a = (out_sz-1) / (bbox[2]-bbox[0])
        b = (out_sz-1) / (bbox[3]-bbox[1])
        c = -a * bbox[0]
        d = -b * bbox[1]
        mapping = np.array([[a, 0, c],
                            [0, b, d]]).astype(np.float)
        crop = self.cv2.warpAffine(image, mapping, (out_sz, out_sz),
                                   borderMode=self.cv2.BORDER_CONSTANT,
                                   borderValue=padding)
        return crop

    def _blur_aug(self, image):
        """blur filter to smooth image

        Parameters
        ----------
        image : np.array
            image

        Return:
            image after blur
        """
        def rand_kernel():
            sizes = np.arange(5, 46, 2)
            size = np.random.choice(sizes)
            kernel = np.zeros((size, size))
            c = int(size/2)
            wx = np.random.random()
            kernel[:, c] += 1. / size * wx
            kernel[c, :] += 1. / size * (1-wx)
            return kernel
        kernel = rand_kernel()
        image = self.cv2.filter2D(image, -1, kernel)
        return image

    def _color_aug(self, image):
        """Random increase of image channel

        Parameters
        ----------
        image : np.array
            image

        Return:
            image after Random increase of channel
        """
        offset = np.dot(self.rgbVar, np.random.randn(3, 1))
        offset = offset[::-1]  # bgr 2 rgb
        offset = offset.reshape(3)
        image = image - offset
        return image

    def _gray_aug(self, image):
        """image Grayscale

        Parameters
        ----------
        image : np.array
            image

        Return:
            image after Grayscale
        """
        grayed = self.cv2.cvtColor(image, self.cv2.COLOR_BGR2GRAY)
        image = self.cv2.cvtColor(grayed, self.cv2.COLOR_GRAY2BGR)
        return image

    def _shift_scale_aug(self, image, bbox, crop_bbox, size):
        """shift scale augmentation

        Parameters
        ----------
        image : np.array
            image
        bbox : list or np.array
            bbox
        crop_bbox :
            crop size image from center

        Return
            image ,bbox after shift and scale
        """
        im_h, im_w = image.shape[:2]

        # adjust crop bounding box
        crop_bbox_center = corner2center(crop_bbox)

        if self.scale:
            scale_x = (1.0 + SiamRPNaugmentation.random() * self.scale)
            scale_y = (1.0 + SiamRPNaugmentation.random() * self.scale)
            h, w = crop_bbox_center.h, crop_bbox_center.w
            scale_x = min(scale_x, float(im_w) / w)
            scale_y = min(scale_y, float(im_h) / h)
            crop_bbox_center = Center(crop_bbox_center.x,
                                      crop_bbox_center.y,
                                      crop_bbox_center.w * scale_x,
                                      crop_bbox_center.h * scale_y)

        crop_bbox = center2corner(crop_bbox_center)
        if self.shift:
            sx = SiamRPNaugmentation.random() * self.shift
            sy = SiamRPNaugmentation.random() * self.shift

            x1, y1, x2, y2 = crop_bbox

            sx = max(-x1, min(im_w - 1 - x2, sx))
            sy = max(-y1, min(im_h - 1 - y2, sy))

            crop_bbox = Corner(x1 + sx, y1 + sy, x2 + sx, y2 + sy)

        # adjust target bounding box
        x1, y1 = crop_bbox.x1, crop_bbox.y1
        bbox = Corner(bbox.x1 - x1, bbox.y1 - y1,
                      bbox.x2 - x1, bbox.y2 - y1)

        if self.scale:
            bbox = Corner(bbox.x1 / scale_x, bbox.y1 / scale_y,
                          bbox.x2 / scale_x, bbox.y2 / scale_y)

        image = self._crop_roi(image, crop_bbox, size)
        return image, bbox

    def _flip_aug(self, image, bbox):
        """flip augmentation

        Parameters
        ----------
        image : np.array
            image
        bbox : list or np.array
            bbox

        Return
            image and bbox after filp
        """
        image = self.cv2.flip(image, 1)
        width = image.shape[1]
        bbox = Corner(width - 1 - bbox.x2, bbox.y1,
                      width - 1 - bbox.x1, bbox.y2)
        return image, bbox

    def __call__(self, image, bbox, size, gray=False):
        shape = image.shape
        crop_bbox = center2corner(Center(shape[0]//2, shape[1]//2,
                                         size-1, size-1))
        # gray augmentation
        if gray:
            image = self._gray_aug(image)

        # shift scale augmentation
        image, bbox = self._shift_scale_aug(image, bbox, crop_bbox, size)

        # color augmentation
        if self.color > np.random.random():
            image = self._color_aug(image)

        # blur augmentation
        if self.blur > np.random.random():
            image = self._blur_aug(image)

        # flip augmentation
        if self.flip and self.flip > np.random.random():
            image, bbox = self._flip_aug(image, bbox)
        return image, bbox

anchorTarget部分代码

# create anchor target
self.anchor_target = AnchorTarget(anchor_stride=self.anchor_stride,
                                          anchor_ratios=self.anchor_ratios,
                                          train_search_size=self.train_search_size,
                                          train_output_size=self.train_output_size,
                                          train_thr_high=self.train_thr_high,
                                          train_thr_low=self.train_thr_low,
                                          train_pos_num=self.train_pos_num,
                                          train_neg_num=self.train_neg_num,
                                          train_total_num=self.train_total_num)
# get labels
cls, delta, delta_weight, _ = self.anchor_target(bbox, self.train_output_size, neg)
template = template.transpose((2, 0, 1)).astype(np.float32)
search = search.transpose((2, 0, 1)).astype(np.float32)

return template, search, cls, delta, delta_weight, np.array(bbox)

class AnchorTarget:
    def __init__(self, anchor_stride, anchor_ratios, train_search_size, train_output_size,
                 train_thr_high=0.6, train_thr_low=0.3, train_pos_num=16, train_neg_num=16,
                 train_total_num=64):
        """create anchor target

        Parameters
        ----------
        anchor_stride : int
            anchor stride
        anchor_ratios : tuple
            anchor ratios
        train_search_size : int
            train search size
        train_output_size : int
            train output size
        train_thr_high : float
            Positive anchor threshold
        train_thr_low : float
            Negative anchor threshold
        train_pos_num : int
            Number of Positive
        train_neg_num : int
            Number of Negative
        train_total_num : int
            total number
        """
        self.anchor_stride = anchor_stride
        self.anchor_ratios = anchor_ratios
        self.anchor_scales = [8]
        self.anchors = Anchors(self.anchor_stride,
                               self.anchor_ratios,
                               self.anchor_scales)
        self.train_search_size = train_search_size
        self.train_output_size = train_output_size

        self.anchors.generate_all_anchors(im_c=self.train_search_size//2,
                                          size=self.train_output_size)
        self.train_thr_high = train_thr_high
        self.train_thr_low = train_thr_low
        self.train_pos_num = train_pos_num
        self.train_neg_num = train_neg_num
        self.train_total_num = train_total_num

    def __call__(self, target, size, neg=False):
        anchor_num = len(self.anchor_ratios) * len(self.anchor_scales)

        # -1 ignore 0 negative 1 positive
        cls = -1 * np.ones((anchor_num, size, size), dtype=np.int64)
        delta = np.zeros((4, anchor_num, size, size), dtype=np.float32)
        delta_weight = np.zeros((anchor_num, size, size), dtype=np.float32)

        def select(position, keep_num=16):
            num = position[0].shape[0]
            if num <= keep_num:
                return position, num
            slt = np.arange(num)
            np.random.shuffle(slt)
            slt = slt[:keep_num]
            return tuple(p[slt] for p in position), keep_num

        tcx, tcy, tw, th = corner2center(target)

        if neg:

            cx = size // 2
            cy = size // 2
            cx += int(np.ceil((tcx - self.train_search_size // 2) /
                              self.anchor_stride + 0.5))
            cy += int(np.ceil((tcy - self.train_search_size // 2) /
                              self.anchor_stride + 0.5))
            l = max(0, cx - 3)
            r = min(size, cx + 4)
            u = max(0, cy - 3)
            d = min(size, cy + 4)
            cls[:, u:d, l:r] = 0

            neg, _ = select(np.where(cls == 0), self.train_neg_num)
            cls[:] = -1
            cls[neg] = 0

            overlap = np.zeros((anchor_num, size, size), dtype=np.float32)
            return cls, delta, delta_weight, overlap

        anchor_box = self.anchors.all_anchors[0]
        anchor_center = self.anchors.all_anchors[1]

        x1, y1, x2, y2 = anchor_box[0], anchor_box[1], \
            anchor_box[2], anchor_box[3]
        cx, cy, w, h = anchor_center[0], anchor_center[1], \
            anchor_center[2], anchor_center[3]

        delta[0] = (tcx - cx) / w
        delta[1] = (tcy - cy) / h
        delta[2] = np.log(tw / w)
        delta[3] = np.log(th / h)

        target = np.array([target[0], target[1], target[2], target[3]]).reshape(1, -1)
        bbox = np.array([x1, y1, x2, y2]).reshape(4, -1).T
        overlap = bbox_iou(bbox, target)
        overlap = overlap.reshape(-1, self.train_output_size, self.train_output_size)
        pos = np.where(overlap > self.train_thr_high)
        neg = np.where(overlap < self.train_thr_low)
        pos, pos_num = select(pos, self.train_pos_num)
        neg, _ = select(neg, self.train_total_num - self.train_pos_num)

        cls[pos] = 1
        delta_weight[pos] = 1. / (pos_num + 1e-6)

        cls[neg] = 0
        return cls, delta, delta_weight, overlap

1.3 训练流程

训练流程大致如下：

从上述训练数据中，加载两张图片template_img, search_img，以及search_img对应的cls_label, delta, delta_weight
将两张图片template_img, search_img输入网络, 网络输出pred_cls, pred_loc, pred_cls表示anchor是否是目标的分数置信度， pred_loc表示目标相对于anchor的中心点偏移值和宽高
根据真实数据cls_label, delta, delta_weight，以及预测数据pred_cls, pred_loc，计算总的loss。（类别损失用交叉熵，位置损失用带权重的L1损失）

训练代码

        for i, data in enumerate(train_loader):
            template, search, label_cls, label_loc, label_loc_weight = train_batch_fn(data, opt)
            cls_losses = []
            loc_losses = []
            total_losses = []
            with autograd.record():
                for j in range(len(opt.ctx)):
                    cls, loc = net(template[j], search[j])
                    label_cls_temp = label_cls[j].reshape(-1).asnumpy()
                    pos_index = np.argwhere(label_cls_temp == 1).reshape(-1)
                    neg_index = np.argwhere(label_cls_temp == 0).reshape(-1)
                    if len(pos_index):
                        pos_index = nd.array(pos_index, ctx=opt.ctx[j])
                    else:
                        pos_index = nd.array(np.array([]), ctx=opt.ctx[j])
                    if len(neg_index):
                        neg_index = nd.array(neg_index, ctx=opt.ctx[j])
                    else:
                        neg_index = nd.array(np.array([]), ctx=opt.ctx[j])
                    cls_loss, loc_loss = criterion(cls, loc, label_cls[j], pos_index, neg_index,
                                                   label_loc[j], label_loc_weight[j])
                    total_loss = opt.cls_weight*cls_loss+opt.loc_weight*loc_loss
                    cls_losses.append(cls_loss)
                    loc_losses.append(loc_loss)
                    total_losses.append(total_loss)

                mx.nd.waitall()
                if opt.use_amp:
                    with amp.scale_loss(total_losses, trainer) as scaled_loss:
                        autograd.backward(scaled_loss)
                else:
                    autograd.backward(total_losses)
            trainer.step(batch_size)
            loss_total_val += sum([l.mean().asscalar() for l in total_losses]) / len(total_losses)
            loss_loc_val += sum([l.mean().asscalar() for l in loc_losses]) / len(loc_losses)
            loss_cls_val += sum([l.mean().asscalar() for l in cls_losses]) / len(cls_losses)
            if i%(opt.log_interval) == 0:
                logger.info('Epoch %d iteration %04d/%04d: loc loss %.3f, cls loss %.3f, \
                             training loss %.3f, batch time %.3f'% \
                            (epoch, i, len(train_loader), loss_loc_val/(i+1), loss_cls_val/(i+1),
                             loss_total_val/(i+1), time.time()-batch_time))
                batch_time = time.time()
            mx.nd.waitall()
        # save every epoch
        if opt.no_val:
            save_checkpoint(net, opt, epoch, False)

损失函数代码

class SiamRPNLoss(gluon.HybridBlock):
    r"""Weighted l1 loss and cross entropy loss for SiamRPN training

    Parameters
    ----------
    batch_size : int, default 128
        training batch size per device (CPU/GPU).

    """
    def __init__(self, batch_size=128, **kwargs):
        super(SiamRPNLoss, self).__init__(**kwargs)
        self.conf_loss = gluon.loss.SoftmaxCrossEntropyLoss()
        self.h = 17
        self.w = 17
        self.b = batch_size
        self.loc_c = 10
        self.cls_c = 5

    def weight_l1_loss(self, F, pred_loc, label_loc, loss_weight):
        """Compute weight_l1_loss"""
        pred_loc = pred_loc.reshape((self.b, 4, -1, self.h, self.w))
        diff = F.abs((pred_loc - label_loc))
        diff = F.sum(diff, axis=1).reshape((self.b, -1, self.h, self.w))
        loss = diff * loss_weight
        return F.sum(loss)/self.b

    def get_cls_loss(self, F, pred, label, select):
        """Compute SoftmaxCrossEntropyLoss"""
        if len(select) == 0:
            return 0
        pred = F.gather_nd(pred, select.reshape(1, -1))
        label = F.gather_nd(label.reshape(-1, 1), select.reshape(1, -1)).reshape(-1)
        return self.conf_loss(pred, label).mean()

    def cross_entropy_loss(self, F, pred, label, pos_index, neg_index):
        """Compute cross_entropy_loss"""
        pred = pred.reshape(self.b, 2, self.loc_c//2, self.h, self.h)
        pred = F.transpose(pred, axes=((0, 2, 3, 4, 1)))
        pred = pred.reshape(-1, 2)
        label = label.reshape(-1)
        loss_pos = self.get_cls_loss(F, pred, label, pos_index)
        loss_neg = self.get_cls_loss(F, pred, label, neg_index)
        return loss_pos * 0.5 + loss_neg * 0.5

    def hybrid_forward(self, F, cls_pred, loc_pred, label_cls, pos_index, neg_index,
                       label_loc, label_loc_weight):
        """Compute loss"""
        loc_loss = self.weight_l1_loss(F, loc_pred, label_loc, label_loc_weight)
        cls_loss = self.cross_entropy_loss(F, cls_pred, label_cls, pos_index, neg_index)
        return cls_loss, loc_loss

2. 测试时跟踪过程

跟踪过程大致可以分为几步：

1.从第一帧图片中，以跟踪目标的中心点截取127*127的区域，作为template
2.在随后的图片中，以上一帧跟踪目标的中心点截取255*255的区域，作为search
3.将template，search送入siamrpn网络预测出目标的box和score
4.对score进行两次后处理， 一是scale penalty和ratio penalty;二是window penalty，即采用窗函数(汉宁窗，余弦窗等)对距离中心点较远的边缘区域分数进行惩罚。
5.取分数最高的box中心点作为新的中心点，上一帧目标的宽高和box的宽高进行平滑加权作为新的宽高
6.采用新的中心点和宽高作为当前帧的box

详细解释如下：

2.1 template和search截取

template设置的大小为(127, 127), search设置为(255, 255), 可以进行调整。(gluoncv中search设置为(287, 287)?)

2.2 后处理一：尺度和宽高比惩罚

outputs = self.model.track(x_crop)
#get coordinate
pred_bbox = self._convert_bbox(outputs['loc'], self.anchors)  #预测的box:[x_center, y_center, w, h]
#get cls
score = self._convert_score(outputs['cls'])        #预测的置信度

def change(hw_r):  # 保证比值大于等于1
	return np.maximum(hw_r, 1. / hw_r)

def get_scale(bbox_w, bbox_h):
    pad = (bbox_w + bbox_h) * 0.5
    return np.sqrt((bbox_w + pad) * (bbox_h + pad))

# scale penalty，
# compare the overall scale of the proposal and last frame
s_c = change(get_scale(pred_bbox[2, :], pred_bbox[3, :]) /
	(get_scale(self.size[0]*scale_z, self.size[1]*scale_z)))   # 尺度变化率：self.size=[w, h], 预测box和上一帧box的尺度比值

# aspect ratio penalty. ratio is ratio of height and width.
# compare proposal’s ratio and last frame
r_c = change((self.size[0]/self.size[1]) /
	(pred_bbox[2, :]/pred_bbox[3, :]))     # 宽高比变换率
penalty = np.exp(-(r_c * s_c - 1) * self.PENALTY_K)   # 默认self.PENALTY_K=0.16， r_c>=1, s_c>=1, 变化率越大惩罚越大
pscore = penalty * score

2.3 后处理二：汉宁窗(Hanning)

汉宁窗定义：

汉宁窗采用余弦函数，其定义为：

\[w(n) = \left\{ \begin{aligned} 0.5(1-cos(\frac{2\pi n}{M-1})), 0\le n\le M &\\ 0, \quad \quad \quad otherwise &\\ \end{aligned} \right. \]

例如构建一个长度为4的汉宁窗，就是取M=4， n依次取0,1,2,3计算函数值：

\[w = [0.5(1-cos\frac{2\pi*0}{3}), 0.5(1-cos\frac{2\pi*1}{3}), 0.5(1-cos\frac{2\pi*2}{3}), 0.5(1-cos\frac{2\pi*3}{3})]\\ =[0,0.75, 0.75, 0] \]

假设要构建4x4的二维汉宁窗，可以直接采用上面一维汉宁窗构建：\(ww^T\):

\[ww^T =[0,0.75, 0.75, 0]* ([0,0.75, 0.75, 0])^T\\ \]

4x4汉宁窗
[[0.     0.     0.     0.    ]
 [0.     0.5625 0.5625 0.    ]
 [0.     0.5625 0.5625 0.    ]
 [0.     0.     0.     0.    ]]

采用numpy构建汉宁窗代码如下：

import numpy as np

# 构建一个4x4的汉宁窗：
def hanning(size=4):
    hanning = np.hanning(size)     # shape(4, )
    print(hanning, hanning.shape)
    window = np.outer(hanning, hanning)  # shape(4, 4):  hanning*(hanning)^T
    print(window, window.shape)

if __name__ == "__main__":
    hanning(4)

汉宁窗使用：

hanning = np.hanning(self.score_size)  # shape(17)
window = np.outer(hanning, hanning)    # shape(17, 17)
self.window = np.tile(window.flatten(), self.anchor_num)   # shape(17*17, 5)
self.WINDOW_INFLUENCE = 0.4

# window penalty
pscore = pscore * (1 - self.WINDOW_INFLUENCE) + \
	self.window * self.WINDOW_INFLUENCE  # pscore为进行scale+ratio penealty后的预测分数, self.window为汉宁窗分数，两者进行加权平均

2.4 更新box

1. anchor理解

对于模型输出的loc和cls进行解码，这里有两点要注意：

对cls先进行解码后，然后进行上述的两步后处理进行惩罚，会改变分数的大小，最后再挑选分数最高的bbox
对bbox进行解码时，采用的anchor与训练时anchor有一点点小差异，两个train_anchor和track_anchor的中心点不一样，宽高一样

如下图中左边是train_anchor，右边是track_anchor，两者差别如下：

train_anchor中心点坐标对应search_image上的像素坐标
track_anchor中心点坐标对应其相对于search_image中心点（127， 127）的偏移值

2. box解码

box解码代码逻辑如下：

    def _convert_bbox(self, delta, anchor):
        """from loc to predict postion

        Parameters
        ----------
            delta : ndarray or np.ndarray
                network output
            anchor : np.ndarray
                generate anchor location

        Returns
        -------
            rejust predict postion though Anchor
        """
        delta = nd.transpose(delta, axes=(1, 2, 3, 0))
        delta = nd.reshape(delta, shape=(4, -1))
        delta = delta.asnumpy()
        delta[0, :] = delta[0, :] * anchor[:, 2] + anchor[:, 0]
        delta[1, :] = delta[1, :] * anchor[:, 3] + anchor[:, 1]
        delta[2, :] = np.exp(delta[2, :]) * anchor[:, 2]
        delta[3, :] = np.exp(delta[3, :]) * anchor[:, 3]
        return delta

上述需要理解的代码为：

delta[0, :] = delta[0, :] * anchor[:, 2] + anchor[:, 0]

delta[0, :] * anchor[:, 2]表示网络预测的目标中心点相对于anchor中心点的距离
anchor[:, 0]表示anchor中心点相对于(127, 127)的偏移值。而(127, 127)就是search_image的中心点，即template_image中目标中心点位置
综合上述，delta[0, :] * anchor[:, 2] + anchor[:, 0]表示的就是预测的目标中心点相对于template_image中目标中心点的偏移位置

所以最终，box更新表达式如下：

center_x = bbox[0] + self.center_pos[0]  # center_pos为上一帧目标的中心点， bbox为网络预测的相对于目标中心点的偏移位置
center_y = bbox[1] + self.center_pos[1]

# clip boundary
center_x, center_y, width, height = self._bbox_clip(center_x, center_y, width, height, img.shape[:2])
self.center_pos = np.array([center_x, center_y])

2.5 宽高更新

宽高更新比较简单，原始box的宽高和预测宽高的加权平均：

# smooth bbox
penalty_lr = penalty[best_idx] * score[best_idx] * self.LR  # penalty为形变和比例惩罚分数，score为网络预测分数，默认self.LR=0.3
width = self.size[0] * (1 - penalty_lr) + bbox[2] * penalty_lr
height = self.size[1] * (1 - penalty_lr) + bbox[3] * penalty_lr

# clip boundary
center_x, center_y, width, height = self._bbox_clip(center_x, center_y, width, height, img.shape[:2])
self.size = np.array([width, height])

posted @ 2022-05-01 12:39 silence_cho 阅读(789) 评论(0) 编辑收藏举报

刷新页面返回顶部

silence_cho