We address the problem of detecting objects in videos with the interest in exploring temporal contexts. Our core idea is to link objects in the short and long ranges for improving the classification quality. Our approach first proposes a set of candidate spatio-temporal cuboids, each of which serves as a container associating the object across short range frames, for a short video segment. It then regresses the precise box locations in each frame over each cuboid proposal, yielding a tubelet with a single classification score which is aggregated from the scores of the boxes in the tubelet. Third, we extend the non-maximum suppression algorithm to remove spatially-overlapping tubelets in the short segment, avoiding tubelets broken by the frame-wise NMS. Finally, we link the tubelets across temporally-overlapping short segments over the whole video, in order to boost the classification scores for positive detections by aggregating the scores in the linked tubelets. Experiments on the ImageNet VID dataset shows that our approach achieves the state-of-the-art performance.
翻译:我们处理在视频中探测对象的问题,并有兴趣探索时间背景。 我们的核心想法是将短距离和长距离的天体连接起来,以提高分类质量。 我们的方法首先提出一组候选的时空幼崽, 每种幼崽都可以作为短距离框架天体连接的容器, 用于一个短视频段。 然后, 将每个幼崽提案的每个框架的精确框位置反转, 产生一个划线, 从管子框的分数中得出一个单一的分类分数。 第三, 我们扩大非最大抑制算法, 以删除短段空间重叠的管子, 避免框架型NMS打破的管子。 最后, 我们将管子连接到整个视频上, 以便通过汇总链接的管子的分数, 提高正确检测的分类分数。 在图像网VID数据集上进行的实验显示, 我们的方法达到了最先进的性能。