与视频中时间情感地方化跨模式共识的封闭的内地综合网络 (Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos)

Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.

翻译：理解人类情感是智能机器人提供更好的人类机器人互动的关键能力。现有的作品仅限于修剪视频级情感分类, 无法找到与情感相对应的时间窗口。在本文中, 我们引入了一个新的任务, 名为“ 时间情感本地化”, 在视频~ (TEL) 中引入了名为“ 时间情感本地化” 的新任务。该任务旨在检测人类情感, 并在未剪接的视频中将其相应的时间界限本地化, 并配以匹配的字幕。 TEL 提供了与时间行动本地化相比的三大独特挑战 :(1) 情感具有极不同的时间动态 ; (2) 情感提示嵌入于外观和复杂地图中 ; 3 精细微的时序说明是复杂和劳动密集型的。为了应对前两个挑战, 我们提出了一个新的扩展背景化环境网络, 通过模拟多光度时间/ 时间级化, 我们引入了一个新的时间- 时间级化数据流, 并用新的时间级化定义, 展示了我们内部的变形变形的变形。