Natural and artificial audition can in principle evolve different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would improve artificial hearing systems and process models of the mind and brain. Speech recognition - an area ripe for such exploration - is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimulus-computable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting a key specification for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.
翻译:自然和人工试镜原则上可以形成对特定问题的不同解决办法。然而,任务的限制可以将试镜的认知科学和工程学与试镜的认知科学和工程质量趋同,表明更密切的相互检查可以改善人工听力系统以及大脑和大脑的流程模型。语音识别(一个已经成熟的探索领域)在人类中具有内在的强大性,可以对各种分泌颗粒质进行若干转变。这些强性剖面在什么程度上由高性能神经网络系统所反映?我们把语音识别实验放在一个单一的综合框架之下,以评价最先进的神经网络作为刺激-可转换、最优化的观察者。在一系列实验中,我们(1)澄清文献中具有影响力的语音操作与他人和自然表达的关系,(2) 展示机器显示分化强度的微粒度,复制人类典型的视觉现象,(3) 确定模型预测人类性能不同的具体条件,(4) 显示所有人工系统在认知性地恢复人类进行科学方面的关键失败,为理论和模型认知和模型的演化设计提供了关键的规格。这些发现鼓励在科学和演化之间形成更紧密的协同作用。