Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as gains on long-tail object queries, and the ability to perform zero-shot and few-shot NLQ.
翻译:以自然语言查询(NLQ)搜索长期以自我为中心的视频在强化现实和机器人方面有着令人信服的应用,在这种应用中,一个人(代理人)以前所看到的一切内容的流体索引能够增加人的记忆和需求方面的表面相关信息,然而,学习问题的结构性(免费文本查询投入、局部视频时间窗口输出)及其针头在甲状腺中的功能性使得监督工作在技术上既具有挑战性又昂贵。我们引入了一个数据增强战略,将标准视频文本解说转化为视频查询本地化模型的培训数据。我们验证了我们在Ego4D基准上的想法,发现它在实际中产生了巨大影响。NaQ通过巨大的边际改进了多个顶级模型(甚至将其准确性翻倍 ), 并产生了迄今为止在Ego4D NLQ挑战上取得的最佳结果, 明显地超过了CVPR和ECCV 2022竞赛中的所有挑战赢家, 以及当前公共领导板上的所有挑战赢家。除了实现NLQ目标的状态之外,我们还展示了我们在NLQ目标上的独特性与零点查询能力。