Natural and artificial audition can in principle acquire different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would potentially enrich artificial hearing systems and process models of the mind and brain. Speech recognition - an area ripe for such exploration - is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimulus-computable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting alternative directions for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.
翻译:自然和人工听觉原则上可以获得不同的解决方案。然而,任务的约束可以促使听觉的认知科学和工程定性收敛,表明更紧密的相互检验可能会丰富人工听觉系统和心智和大脑的过程模型。语音识别——一个值得探索的领域——在人类中本质上是稳健的,可以应对不同分辨率的各种变形。高性能神经网络系统对这些鲁棒性配置的程度如何?我们将语音识别实验汇集到一个综合框架下,评估最先进的神经网络作为能够计算的、优化的观察者。在一系列实验中,我们(1)澄清了文献中有影响力的语音调制如何相互关联,并与自然语音相关,(2)展示了机器表现出分布鲁棒性的粒度,复制了人类的经典感知现象,(3)确定了模型预测人类性能差异的具体条件,以及(4)展示了所有人工系统的一个关键失误,在人类恢复感知的地方无法感知,这表明了理论和模型建立的替代方向。这些发现鼓励更紧密的听觉认知科学和工程之间的协同。