基于自监督学习的脑部语音处理逼真模型的研究 (Toward a realistic model of speech processing in the brain with self-supervised learning)

Several deep neural networks have recently been shown to generate activations similar to those of the brain in response to the same input. These algorithms, however, remain largely implausible: they require (1) extraordinarily large amounts of data, (2) unobtainable supervised labels, (3) textual rather than raw sensory input, and / or (4) implausibly large memory (e.g. thousands of contextual words). These elements highlight the need to identify algorithms that, under these limitations, would suffice to account for both behavioral and brain responses. Focusing on the issue of speech processing, we here hypothesize that self-supervised algorithms trained on the raw waveform constitute a promising candidate. Specifically, we compare a recent self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412 English, French, and Mandarin individuals recorded with functional Magnetic Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our results are four-fold. First, we show that this algorithm learns brain-like representations with as little as 600 hours of unlabelled speech -- a quantity comparable to what infants can be exposed to during language acquisition. Second, its functional hierarchy aligns with the cortical hierarchy of speech processing. Third, different training regimes reveal a functional specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic, speech-specific and language-specific representations similar to those of the prefrontal and temporal cortices. Fourth, we confirm the similarity of this specialization with the behavior of 386 additional participants. These elements, resulting from the largest neuroimaging benchmark to date, show how self-supervised learning can account for a rich organization of speech processing in the brain, and thus delineate a path to identify the laws of language acquisition which shape the human brain.

翻译：最近，已经有几种深度神经网络显示出与大脑响应类似的激活。然而，这些算法仍然存在严重问题：它们需要（1）极其大量的数据，（2）无法获得的监督标签，（3）文本而不是原始感官输入，以及/或（4）不切实际的大容量内存（例如数千个上下文单词）。这些元素突显了我们需要确定哪些算法，能够在这些限制下，足以解释行为和脑响应。针对语音处理问题，我们在这里假设，在原始波形上训练的自监督算法是一个很有希望的候选。具体地，我们将最近的自监督体系结构Wav2Vec 2.0与412名英语、法语和普通话受试者的脑活动进行了比较，这些受试者在听~1小时的有声读物时，使用功能磁共振成像技术进行记录。我们的研究结果有四个方面。首先，我们证明了，仅用600个小时的未标记语音，该算法就能学习到类似于大脑的表示 - 这一数量与婴儿在语言习得过程中经历的数量相当。其次，该算法的功能层次结构与语音处理的皮层层次结构相一致。第三，不同的训练方式揭示了类似于大脑皮层的功能专业化：Wav2Vec 2.0学习了声音共性、语音特异性和语言特异性表示，这些表示类似于前额和颞叶皮层的表示。第四，我们证实了这种专业化与额外的386名受试者的行为相似。这些结果来自迄今为止最大的神经成像基准试验，展示了自监督学习如何解释大脑中语音处理的丰富组织，从而描绘了描绘了塑造人脑的语言习得规律。