Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is demonstrated through a phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of illusory intermediate phonemes. Building on a recent framework that proposes how to address developmental 'why' questions using artificial neural networks, we evaluated a set of recent artificial neural networks trained on audiovisual speech by testing them with audiovisually incongruent words designed to elicit the McGurk effect. We show that networks trained entirely on congruent audiovisual speech nevertheless exhibit the McGurk percept. We further investigated 'why' by comparing networks trained on clean speech to those trained on noisy speech, and discovered that training with noisy speech led to a pronounced increase in both visual responses and McGurk responses across all models. Furthermore, we observed that systematically increasing the level of auditory noise during ANN training also increased the amount of audiovisual integration up to a point, but at extreme noise levels, this integration failed to develop. These results suggest that excessive noise exposure during critical periods of audiovisual learning may negatively influence the development of audiovisual speech integration. This work also demonstrates that the McGurk effect reliably emerges untrained from the behaviour of both supervised and unsupervised networks, even networks trained only on congruent speech. This supports the notion that artificial neural networks might be useful models for certain aspects of perception and cognition.
翻译:人类能够融合听觉和视觉模态的信息以辅助理解语音,这一能力通过被称为麦格克效应的现象得以体现:当听者接收到不一致的听觉与视觉语音刺激时,会融合感知为虚假的中间音素。基于近期提出的利用人工神经网络探讨发展性'原因'问题的框架,我们通过设计诱发麦格克效应的视听不一致词汇,测试了一系列在视听语音数据上训练的现代人工神经网络。研究发现,即使在完全使用一致视听语音训练的神经网络中,仍会呈现麦格克感知现象。为探究其成因,我们比较了在清晰语音与噪声语音上训练的网络,发现噪声语音训练会导致所有模型的视觉响应与麦格克响应显著增强。进一步研究表明,在神经网络训练过程中系统性地提高听觉噪声水平,可在一定范围内增强视听整合程度,但在极端噪声水平下,这种整合能力无法形成。这些结果提示,在视听学习的关键期暴露于过量噪声,可能对视听语音整合能力的发展产生负面影响。本工作同时证明,麦格克效应能够未经专门训练地从监督式与非监督式网络的行为中自发涌现,即使这些网络仅使用一致语音数据训练。这支持了人工神经网络可作为感知与认知特定方面有效模型的学术观点。