In this paper, we investigate a weakly-supervised object detection framework. Most existing frameworks focus on using static images to learn object detectors. However, these detectors often fail to generalize to videos because of the existing domain shift. Therefore, we investigate learning these detectors directly from boring videos of daily activities. Instead of using bounding boxes, we explore the use of action descriptions as supervision since they are relatively easy to gather. A common issue, however, is that objects of interest that are not involved in human actions are often absent in global action descriptions known as "missing label". To tackle this problem, we propose a novel temporal dynamic graph Long Short-Term Memory network (TD-Graph LSTM). TD-Graph LSTM enables global temporal reasoning by constructing a dynamic graph that is based on temporal correlations of object proposals and spans the entire video. The missing label issue for each individual frame can thus be significantly alleviated by transferring knowledge across correlated objects proposals in the whole video. Extensive evaluations on a large-scale daily-life action dataset (i.e., Charades) demonstrates the superiority of our proposed method. We also release object bounding-box annotations for more than 5,000 frames in Charades. We believe this annotated data can also benefit other research on video-based object recognition in the future.
翻译:在本文中,我们调查了一个监督不力的物体探测框架。 大多数现有框架都侧重于使用静态图像来学习物体探测器。 但是,由于现有的域变换,这些探测器往往无法推广到视频中。 因此,我们直接从日常活动的无聊视频中学习这些探测器。 我们不是使用捆绑框,而是使用行动描述作为监督,因为它们比较容易收集。 但是,一个共同的问题是,在全球行动描述中,通常没有涉及人类行动的利益对象,称为“丢失标签 ” 。 为了解决这一问题,我们提议建立一个新的时空动态图“ 长短时间内存网 ” (TD-Graph LSTM ) 。 TD- Graph LSTM 通过构建一个动态图表,以对象提议的时间相关性为基础,并跨越整个视频。 因此,每个框架缺失的标签问题可以通过在整个视频中传递相关对象提议的知识而大大缓解。 对大规模日常生活行动数据集( e. Charades) 进行广泛的评估, 显示了我们拟议采用的方法的优越性。 TD- Graph LSTM 使得全球时间推理推论成为全球时间推论, 通过构建一个动态图, 能够对今后的数据进行更多的解释。