Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, navigation in the space, or human-human interactions -- that unfold continuously, driven by the person's goals. We argue that this calls for a much more unified approach. We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once. Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition. Demonstrating our model on a wide array of video tasks from Ego4D, we show its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.
翻译:不同的视频理解任务通常被单独处理,甚至使用不同类型的策划数据(例如,在一个数据集中分类体育运动,在另一个数据集中跟踪动物)。然而,在佩戴式摄像头中,人在与周围世界互动时的沉浸式个人视角呈现出一个相互关联的视频理解任务网络 -- 手 - 物体操作、空间导航或人 - 人互动 -- 它们持续不断地展开,由人的目标驱动。我们认为这需要一种更统一的方法。我们提出了“个人任务翻译”(EgoT2)的概念,它采用了在不同任务上优化的模型集合,并学习将它们的输出翻译为对所有任务的或所有任务的改进性能。与传统的转移或多任务学习不同,EgoT2的翻转设计涉及到单独的任务特定的骨干和一个任务翻译器,在所有任务之间捕捉甚至不同的任务之间的协同作用和缓解任务竞争。在Ego4D的广泛视频任务上展示我们的模型,我们展示了它在现有转移范例上的优势,并在Ego4D 2022基准挑战的四项任务中取得了排名靠前的成果。