We study the task of object interaction anticipation in egocentric videos. Successful prediction of future actions and objects requires an understanding of the spatio-temporal context formed by past actions and object relationships. We propose TransFusion, a multimodal transformer-based architecture, that effectively makes use of the representational power of language by summarizing past actions concisely. TransFusion leverages pre-trained image captioning models and summarizes the caption, focusing on past actions and objects. This action context together with a single input frame is processed by a multimodal fusion module to forecast the next object interactions. Our model enables more efficient end-to-end learning by replacing dense video features with language representations, allowing us to benefit from knowledge encoded in large pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model and the benefits of using language-based context summaries. Our method outperforms state-of-the-art approaches by 40.4% in overall mAP on the Ego4D test set. We show the generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at: https://eth-ait.github.io/transfusion-proj/.
翻译:我们用以自我为中心的视频来研究对象互动预期的任务。 成功预测未来行动和对象需要理解过去行动和对象关系形成的时空环境。 我们提议Tranfusion(基于多式联运变压器的架构),通过简明扼要地总结过去的行动,有效地利用语言代表力。 Transfusion(TransFusion)利用预先训练的图像字幕模型,并摘要说明标题,重点是过去的行动和对象。这个行动背景和单一输入框架由一个多式联运集成模块处理,以预测下一个对象的相互作用。我们的模型通过用语言表示方式取代密集的视频特征,从而能够更有效地进行端到端学习,从而使我们能够受益于在大型预先培训模式中编码的知识。 Ego4D 和 EPIC-KITCHENS-100 实验显示了我们的多式联运组合模型的有效性和使用基于语言的背景摘要的好处。 我们的方法在Ego4D测试集成的 mAP 中将状态- 艺术方法比40.4% 的全方位 mAP, 我们通过 EGEPIC-KITISM/100 Videal 的实验展示了Tradings: MAPIC- MATINSML/100