Self-supervised language models are very effective at predicting high-level cortical responses during language comprehension. However, the best current models of lower-level auditory processing in the human brain rely on either hand-constructed acoustic filters or representations from supervised audio neural networks. In this work, we capitalize on the progress of self-supervised speech representation learning (SSL) to create new state-of-the-art models of the human auditory system. Compared against acoustic baselines, phonemic features, and supervised models, representations from the middle layers of self-supervised models (APC, wav2vec, wav2vec 2.0, and HuBERT) consistently yield the best prediction performance for fMRI recordings within the auditory cortex (AC). Brain areas involved in low-level auditory processing exhibit a preference for earlier SSL model layers, whereas higher-level semantic areas prefer later layers. We show that these trends are due to the models' ability to encode information at multiple linguistic levels (acoustic, phonetic, and lexical) along their representation depth. Overall, these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
翻译:自我监督的语言模型对于在语言理解期间预测高层次的听觉反应非常有效。然而,目前人类大脑中低层次的听觉处理的最佳模型依赖于手建音响过滤器或受监督的听觉网络的演示。在这项工作中,我们利用自我监督的语音演示学习(SSL)的进展来创建人类听觉系统的最新最先进的模型。与声学基线、语音特征和监督模型相比,与自我监督模型中层(ACC、Wav2vec、Wav2vec 2.0和HuBERT)的演示相比,目前最佳的低层次的听觉处理模型模型(SSL)的演示模型(SSL)始终能产生FMRI记录的最佳预测性能。在低层次的听觉处理中,大脑领域更倾向于早期的SSLS模型,而较高层次的语管区则更倾向于后层。我们表明,这些趋势是由于模型在多种语言层次(APC、WAV2V2V、WV2vec 2.0和HuBERT)的演示中层, 和HUBERT的中层的演示模型,始终都能产生最佳的预测性功能。这些结果显示人类语音结构结构结构的自我监督模型的自我监督,其相关阶段的自我监督模型的自我监督,其内部的自我监督模型可有效采集的模型可捕测测测测测测制为人类演制。这些模型。这些模型。这些模型显示。总体测。这些模型显示为人类演制到与人类语言层次的图像。结果的图像的模型有效测。