Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.
翻译:人类已经发展了将各种感官来源的信息融合在一起的绝妙能力,探索了它们固有的互补性,因此,感知能力得到了提高,例如,使众所周知的“鸡尾酒党”和麦古克效应成为可能,即从声响信号的全套中发出言语模糊。这种融合能力对于改善对音源位置的认识也是至关重要的,正如在集体对话中听到其声音的区别一样。此外,神经科学已经成功地将大脑中的优等合体区域确定为负责这一模式融合的区域,其中少数生物模型被提议接近其内在神经生理过程。从其中一种模型中提取灵感,这是有效利用相关听觉和视觉信息进行积极语音检测的一种方法。这种能力可以具有广泛的应用,从远程会议系统到社交机器人等。最初的探测方法通过两个专门的神经网络结构将听力和视觉信息输送到两个专门的神经网络结构。由此而形成的嵌入通过一个新层次进行融合,该结构的顶部结构可以模仿其神经物理过程,其顶部结构可以模仿空间神经元的交叉图象学和视觉结果。