脸部、身体、声音:以多种方式收录视频 (Face, Body, Voice: Video Person-Clustering with Multiple Modalities)

The objective of this work is person-clustering in videos -- grouping characters according to their identity. Previous methods focus on the narrower task of face-clustering, and for the most part ignore other cues such as the person's voice, their overall appearance (hair, clothes, posture), and the editing structure of the videos. Similarly, most current datasets evaluate only the task of face-clustering, rather than person-clustering. This limits their applicability to downstream applications such as story understanding which require person-level, rather than only face-level, reasoning. In this paper we make contributions to address both these deficiencies: first, we introduce a Multi-Modal High-Precision Clustering algorithm for person-clustering in videos using cues from several modalities (face, body, and voice). Second, we introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering. It contains body-tracks for each annotated character, face-tracks when visible, and voice-tracks when speaking, with their associated features. The dataset is by far the largest of its kind, and covers films and TV-shows representing a wide range of demographics. Finally, we show the effectiveness of using multiple modalities for person-clustering, explore the use of this new broad task for story understanding through character co-occurrences, and achieve a new state of the art on all available datasets for face and person-clustering.

翻译：这项工作的目标是将人集中到视频中 -- -- 将人物按其身份分组。先前的方法侧重于面团的狭义任务, 并且大部分忽略了其他线索, 比如个人的声音、整体外观( 头发、衣服、姿态) 和视频的编辑结构。同样, 大多数当前的数据集只评估面团的任务, 而不是人群集。这限制了它们适用于下游应用, 比如故事理解, 需要人层次, 而不是仅根据面团的推理。在本文中, 我们为解决这两个缺陷做出了贡献 : 首先, 我们采用多模式( 面团、衣服、姿态) 和视频编辑结构的提示, 在视频中引入多模式( 面团、姿态、姿态) 和编辑结构。其次, 我们引入视频组合数据集只评估面团的任务, 包含每个附加说明的字符、面部、声音轨迹, 以及它们的相关特征。数据集是其最庞大的种类, 并覆盖了通过多种模式( 图像) 展示新的个人、多层、图像和电视结构, 展示新的组合任务。