Current Active Speaker Detection (ASD) models achieve great results on AVA-ActiveSpeaker (AVA), using only sound and facial features. Although this approach is applicable in movie setups (AVA), it is not suited for less constrained conditions. To demonstrate this limitation, we propose a Wilder Active Speaker Detection (WASD) dataset, with increased difficulty by targeting the two key components of current ASD: audio and face. Grouped into 5 categories, ranging from optimal conditions to surveillance settings, WASD contains incremental challenges for ASD with tactical impairment of audio and face data. We select state-of-the-art models and assess their performance in two groups of WASD: Easy (cooperative settings) and Hard (audio and/or face are specifically degraded). The results show that: 1) AVA trained models maintain a state-of-the-art performance in WASD Easy group, while underperforming in the Hard one, showing the 2) similarity between AVA and Easy data; and 3) training in WASD does not improve models performance to AVA levels, particularly for audio impairment and surveillance settings. This shows that AVA does not prepare models for wild ASD and current approaches are subpar to deal with such conditions. The proposed dataset also contains body data annotations to provide a new source for ASD, and is available at https://github.com/Tiago-Roxo/WASD.
翻译:目前活跃的音频检测模型(ASD)在AVA-AviewSpeaker(AVA)上取得了巨大成果,仅使用音响和面部特征。虽然这一方法适用于电影组(AVA),但并不适合较不严格的条件。为了证明这一局限性,我们提议了一个野生活跃的音频检测模型(WASD)数据集,通过针对当前ASD的两个关键组成部分:音频和面部。将ASD分为五个类别,从最佳条件到监视设置,WASD包含对ASD的战术障碍,对AVA级和面部数据进行战术破坏的递增性挑战。我们选择了最先进的模型,并评估了其在两个组合中的表现:容易(合作设置)和硬体(音频和/或面部(面面部和/面部特别退化)。结果显示:1)AVA经过训练的模型保持了WAD的艺术状态,而在硬体组中表现不佳,显示AVA和易数据的相似性能;3)WAD的培训没有改进AVA的模型,特别是音频损伤和监视设置。这表明AVAVAVA没有为ARSD/ASD提供新的数据。</s>