We propose the task Future Object Detection, in which the goal is to predict the bounding boxes for all visible objects in a future video frame. While this task involves recognizing temporal and kinematic patterns, in addition to the semantic and geometric ones, it only requires annotations in the standard form for individual, single (future) frames, in contrast to expensive full sequence annotations. We propose to tackle this task with an end-to-end method, in which a detection transformer is trained to directly output the future objects. In order to make accurate predictions about the future, it is necessary to capture the dynamics in the scene, both object motion and the movement of the ego-camera. To this end, we extend existing detection transformers in two ways. First, we experiment with three different mechanisms that enable the network to spatiotemporally process multiple frames. Second, we provide ego-motion information to the model in a learnable manner. We show that both of these extensions improve the future object detection performance substantially. Our final approach learns to capture the dynamics and makes predictions on par with an oracle for prediction horizons up to 100 ms, and outperforms all baselines for longer prediction horizons. By visualizing the attention maps, we observe that a form of tracking emerges within the network. Code is available at github.com/atonderski/future-object-detection.
翻译:我们提出未来物体探测任务, 目标是预测未来视频框架中所有可见天体的捆绑框。 虽然任务涉及识别时间和运动模式, 除了语义和几何模式之外, 只需要个人、 单( 未来) 框架的标准格式说明, 而不是昂贵的完整序列说明。 我们提议用一个端对端方法来应对这项任务, 即 检测变压器经过培训, 直接输出未来天体。 为了准确预测未来, 有必要捕捉现场的动态, 包括物体运动和自我摄像头的移动。 我们为此以两种方式扩展现有的探测变异器。 首先, 我们实验三个不同的机制, 使网络能够对多个框架进行随机式处理。 其次, 我们以可以学习的方式向模型提供自我感动信息。 我们用这两种扩展来提高未来天体探测天体的探测性表现。 为了对未来天体进行准确的预测, 我们的最后方法是捕捉动态, 并用一个或孔的动态来进行预测, 到100米/ 。 我们用一个可观测的视野/ 基线来观测所有视野 。