Video object detection is challenging in the presence of appearance deterioration in certain video frames. Therefore, it is a natural choice to aggregate temporal information from other frames of the same video into the current frame. However, RoI Align, as one of the most core procedures of video detectors, still remains extracting features from a single-frame feature map for proposals, making the extracted RoI features lack temporal information from videos. In this work, considering the features of the same object instance are highly similar among frames in a video, a novel Temporal RoI Align operator is proposed to extract features from other frames feature maps for current frame proposals by utilizing feature similarity. The proposed Temporal RoI Align operator can extract temporal information from the entire video for proposals. We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal RoI Align operator can consistently and significantly boost the performance. Besides, the proposed Temporal RoI Align can also be applied into video instance segmentation.
翻译:在某些视频框的外观恶化情况下,视频对象的探测具有挑战性。 因此,将同一视频框的其他框架的时间信息汇总到当前框架是一种自然的选择。 然而,作为视频探测器最核心的程序之一,RoI Aleign仍然从单一框架特写图中提取建议书的特征,使得抽取的RoI特征缺乏视频提供的时间信息。在这项工作中,考虑到同一对象实例的特征在视频框中非常相似,因此建议了一个新的Temporal RoI Aleign操作器,利用相似的特征从其他框架地貌图中提取当前框架提案的特征。拟议的Temoral RoI Aleign操作器可以从整个视频中提取时间信息作为建议书。我们将其纳入单一框架视频探测器和其他最新视频探测器,并进行定量实验,以证明拟议的Temoral RoI Align操作器能够持续和显著地提升性能。此外,拟议的Temoral RoI Align还可以在视频实例分割中应用。