会员服务 ·

问答 | 关于目标检测进行框回归的问题

2018 年 10 月 4 日 AI研习社

这里是 AI 研习社，我们的问答版块已经正式推出了！欢迎大家来多多交流~

https://club.leiphone.com/page/question

（戳文末阅读原文直接进）

社长为你推荐来自 AI 研习社问答社区的精华问答。如有你也有问题，欢迎进社区提问。

话不多说，直接上题

问：关于目标检测进行框回归的问题

目标检测中，要生成区域提议并进行回归，我看到的方法在生成区域以后都进了卷积层，也就是在特征空间上形成了映射，那也就是框回归都是在特征空间进行的，我一直没弄明白全链接是怎么进行回归的，另外，如果自己不用全链接的话，我该如何生成区域提议以及进行回归的。

来自社友的回答

▼▼▼

@玛希•加西亚：

全连接已经从目标检测中慢慢淡出了，在识别中全连接用于生成各个类别的score还保留在，主要是起到关联整体特征的作用，并不是题主可能认为的只输出单一的值。目标检测中全连接慢了就不再用了，而且目标检测的趋势是融合各个层的feature map来做，而不仅是只用最后一层。

回归的话，loss函数是拿预测值和真实物体位置求差值的smooth L1得到。预测值和真实物体位置会相对于anchor box做encode和decode。encode是将相对于图片的位置转换为相对于anchor box中心的位置，decode反之。

假设最后得到一个5*5的feature map，那么最终使用的feature map形态为5 * 5 * anchor number * (class, Lx, Ly, w, h)。Lx和Ly是预测值或真实物体位置相对于anchor box中心的偏移，w和h是相对于anchor box的大小。多想几遍就不会再弄晕了：你最后得到的结果是将图片划分为5*5的区域，每个区域有anchor number个预测值，每个值由(class, Lx, Ly, w, h)组成。

@muglelei：

可以参考下TensorFlow的OD API中RFCN的这部分代码(从image_features到最后的box_encodings，注释自己加的不一定对)：

   net = image_features
    with slim.arg_scope(self._conv_hyperparams):
      # depth: Target depth to reduce the input feature maps to.
      # 1×1卷积层，减少feature maps数量
      net = slim.conv2d(net, self._depth, [1, 1], scope='reduce_depth')
      
      # Location predictions. 位置预测部分
      
      # box_code_size: Size of encoding for each box. [default = 4]
      # k^2 × (C+1) x 4
      location_feature_map_depth = (self._num_spatial_bins[0] *
                                    self._num_spatial_bins[1] *
                                    self.num_classes *
                                    self._box_code_size)
      
      # 1×1卷积层，对每一类产生k^2张position-sensitive score maps
      # & append a sibling 4k^2-d conv layer for bounding box regression
      location_feature_map = slim.conv2d(net, location_feature_map_depth, [1, 1], 
                                         activation_fn=None,
                                         scope='refined_locations')
      
      # position-sensitive RoI池化层
      box_encodings = ops.position_sensitive_crop_regions(
          location_feature_map,
          boxes=tf.reshape(proposal_boxes, [-1, self._box_code_size]),
          box_ind=get_box_indices(proposal_boxes),
          crop_size=self._crop_size,
          num_spatial_bins=self._num_spatial_bins,
          global_pool=True)
      
      # tf.squeeze去掉维度为1的维
      box_encodings = tf.squeeze(box_encodings, squeeze_dims=[1, 2])
      
      # 调整box编码的形状
      box_encodings = tf.reshape(box_encodings,
                                 [batch_size * num_boxes, 1, 
                                  self.num_classes, self._box_code_size])