We explore future object prediction -- a challenging problem where all objects visible in a future video frame are to be predicted. We propose to tackle this problem end-to-end by training a detection transformer to directly output future objects. In order to make accurate predictions about the future, it is necessary to capture the dynamics in the scene, both of other objects and of the ego-camera. We extend existing detection transformers in two ways to capture the scene dynamics. First, we experiment with three different mechanisms that enable the model to spatiotemporally process multiple frames. Second, we feed ego-motion information to the model via cross-attention. We show that both of these cues substantially improve future object prediction performance. Our final approach learns to capture the dynamics and make predictions on par with an oracle for 100 ms prediction horizons, and outperform baselines for longer prediction horizons.
翻译:我们探索未来天体预测 -- -- 一个挑战性的问题,即未来视频框架中可见的所有天体都必须预测。 我们提议通过训练探测变压器直接输出未来天体来解决这一问题。 为了准确预测未来, 有必要捕捉现场的动态, 包括其他天体和自我相机的动态。 我们以两种方式扩展现有的探测变压器以捕捉场景动态。 首先, 我们实验三个不同的机制, 使模型能够对流处理多个框架。 其次, 我们通过交叉注意向模型提供自我感动信息。 我们显示, 这两种信号都大大改进了未来天体预测的性能。 我们的最后方法学习了如何捕捉动态, 并且将预测与100 ms预测地平线的甲骨架相匹配, 以及更远的预测地平线的超形基线。