Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks -- hand-object manipulations, navigation in the space, or human-human interactions -- that unfold continuously, driven by the person's goals. We argue that this calls for a much more unified approach. We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once. Unlike traditional transfer or multi-task learning, EgoT2's flipped design entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition. Demonstrating our model on a wide array of video tasks from Ego4D, we show its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges.
翻译:不同的视频理解任务通常被孤立地处理,甚至使用不同类型的整理数据(例如,在一个数据集中对体育进行分类,跟踪另一个数据集中的动物),然而,在可磨损的相机中,与周围世界接触的人的隐性自我中心观点呈现了一个相互连接的视频理解任务网络 -- -- 手工物体操纵、空间导航或人与人之间的相互作用 -- -- 由个人的目标驱动,这些任务持续展开。我们争辩说,这要求采取更加统一的方法。我们提议EgoTask翻译(EgoT2),其中收集了以不同任务优化的模型,并学习了将其产出转化为任何或所有这些模型的改进。与传统的转移或多任务学习不同,EgoT2的翻动设计需要单独的特定任务骨干和一个任务翻译,所有任务都是分担的,这可以捕捉甚至多种任务之间的协同作用,减轻任务的竞争。我们提议在Ego4D的范围广泛的视频任务中展示我们的模型,我们展示其相对于现有转移模式的优势,并在Ego422基准的四项挑战上取得最高级的成果。