Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent to the pixel displacement, a common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset. With this method,they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an \textbf{I}n-network \textbf{F}eature \textbf{F}low estimation module (IFF module) for video object detection. Without resorting pre-training on any additional dataset, our IFF module is able to directly produce \textbf{feature flow} which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on \textit{self-supervision}, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and sets a state-of-the-art performance on ImageNet VID.
翻译:显示像素置换的光学流在许多计算机视觉任务中被广泛使用, 以提供像素调换运动信息。 但是, 随着星际神经网络的显著进步, 提出了最新的最先进的方法, 直接解决地貌层次的问题。 由于特性矢量的移位与像素移位不一致, 一个共同的方法是: 将光学流转向神经网络, 并在任务数据集上微调这个网络。 使用这种方法, 他们期望精细调整的网络能够生成像素编码特性级的运动信息。 在本文中, 我们重新思考这个事实上的模型, 分析其在视频对象检测任务中的退步。 为了缓解这些问题, 我们建议了一个新型网络( IFF-Net), 使用\ textbff{F} 网络 特性来向神经网络网络网络流流转, 用于检测视频对象。 在任何额外数据设置上, 我们的 IFF 模块能够直接生成 ITR 字节流 { fetal 方向 分析它的退缩情况 。 我们的深度测量模块可以显示我们现有的直径变 。