Different from static images, videos contain additional temporal and spatial information for better object detection. However, it is costly to obtain a large number of videos with bounding box annotations that are required for supervised deep learning. Although humans can easily learn to recognize new objects by watching only a few video clips, deep learning usually suffers from overfitting. This leads to an important question: how to effectively learn a video object detector from only a few labeled video clips? In this paper, we study the new problem of few-shot learning for video object detection. We first define the few-shot setting and create a new benchmark dataset for few-shot video object detection derived from the widely used ImageNet VID dataset. We employ a transfer-learning framework to effectively train the video object detector on a large number of base-class objects and a few video clips of novel-class objects. By analyzing the results of two methods under this framework (Joint and Freeze) on our designed weak and strong base datasets, we reveal insufficiency and overfitting problems. A simple but effective method, called Thaw, is naturally developed to trade off the two problems and validate our analysis. Extensive experiments on our proposed benchmark datasets with different scenarios demonstrate the effectiveness of our novel analysis in this new few-shot video object detection problem.
翻译:与静态图像不同, 视频包含额外的时间和空间信息以更好地探测物体。 然而, 获取大量带捆绑的视频带带的带框的附加说明是监督深层学习所需要的, 成本很高。 虽然人类可以通过只看几部视频剪辑很容易地学会识别新对象, 深层学习通常会过大。 这导致一个重要问题 : 如何从仅有几个标签的视频剪辑中有效地学习视频物体探测器? 本文中, 我们研究了视频物体探测的短片学习的新问题 。 我们首先定义了短片设置, 并为从广泛使用的图像网VID数据集中获得的短片视频对象探测创建了新的基准数据集。 我们使用一个传输学习框架, 有效培训大量基本对象的视频对象探测器, 以及几个小类对象视频剪辑。 通过分析这个框架下两种方法( 联合和冻结) 的结果, 我们揭示了微弱而强的基础数据集的不完善问题 。 一个简单但有效的方法, 叫做Taw, 是自然地开发了一种方法, 来交换两个问题, 并验证我们新的检测目标分析结果。