Temporal action localization (TAL) is an important task extensively explored and improved for third-person videos in recent years. Recent efforts have been made to perform fine-grained temporal localization on first-person videos. However, current TAL methods only use visual signals, neglecting the audio modality that exists in most videos and that shows meaningful action information in egocentric videos. In this work, we take a deep look into the effectiveness of audio in detecting actions in egocentric videos and introduce a simple-yet-effective approach via Observing, Watching, and Listening (OWL) to leverage audio-visual information and context for egocentric TAL. For doing that, we: 1) compare and study different strategies for where and how to fuse the two modalities; 2) propose a transformer-based model to incorporate temporal audio-visual context. Our experiments show that our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
翻译:近些年来,人们广泛探讨并改进了第三人视频的时间行动定位(TAL)是一项重要任务,近年来,人们努力在第一人视频上进行细微的刻度时间定位。然而,目前TAL方法只使用视觉信号,忽视大多数视频中存在的音频模式,而这种模式则在以自我为中心的视频中显示有意义的行动信息。在这项工作中,我们深刻地审视音频在发现以自我为中心的视频中的行为方面的有效性,并通过观测、监视和监听(OWL)引入一种简单而有效的方法,利用视听信息和背景来利用以自我为中心的TAL。为此,我们:(1) 比较和研究不同战略,以确定这两种模式的在何处以及如何结合;(2) 提议一种基于变压器的模型,以纳入时间性视听背景。我们的实验表明,我们的方法在EPIC-KITCHENS-100上取得了最先进的表现。