多模式基础模型更能模拟人的大脑 (Multimodal foundation models are better simulators of the human brain)

Multimodal learning, especially large-scale multimodal pre-training, has developed rapidly over the past few years and led to the greatest advances in artificial intelligence (AI). Despite its effectiveness, understanding the underlying mechanism of multimodal pre-training models still remains a grand challenge. Revealing the explainability of such models is likely to enable breakthroughs of novel learning paradigms in the AI field. To this end, given the multimodal nature of the human brain, we propose to explore the explainability of multimodal learning models with the aid of non-invasive brain imaging technologies such as functional magnetic resonance imaging (fMRI). Concretely, we first present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs, which has shown strong multimodal understanding and generalization abilities in a variety of cognitive downstream tasks. Further, from the perspective of neural encoding (based on our foundation model), we find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones. Particularly, we identify a number of brain regions where multimodally-trained encoders demonstrate better neural encoding performance. This is consistent with the findings in existing studies on exploring brain multi-sensory integration. Therefore, we believe that multimodal foundation models are more suitable tools for neuroscientists to study the multimodal signal processing mechanisms in the human brain. Our findings also demonstrate the potential of multimodal foundation models as ideal computational simulators to promote both AI-for-brain and brain-for-AI research.

翻译：过去几年来,多模式学习,特别是大型多式联运培训前的学习,发展迅速,并导致人工智能(AI)取得最大进步。尽管其效果有效,但了解多式联运培训前模式的基本机制仍是一项巨大挑战。解释这些模式的解释有可能使AI领域新的学习模式出现突破。为此,鉴于人类大脑的多式联运性质,我们提议探索多式联运学习模式在功能磁共振成像(fMRI)等非侵入性脑成像技术的帮助下是否具有可解释性。具体地说,我们首先展示了新设计的多式联运基础模型,该模型在1 500万对成像前经过培训,在各种认知下游任务中表现出很强的多式联运理解和普及能力。此外,从神经编码的角度(以我们的基础模型为基础),我们发现经过培训的视觉和语言编码都比非模拟模型更像大脑。我们发现了一些经多式联运培训的大脑区域,在那里,对1,1,500万对成对成对成对成对成对成型的成型成型成型的成型的成型的成型的成型的成型模型进行了预先培训。这与目前研究的多式联运的模型研究的模型研究结论更符合我们研究的模型的模型的模型,用以探索的模型研究基础,用以研究研究的人类的模型的模型的模型的模型的模型的模型的人类的模型的模型,也证明了。