Enabling effective brain-computer interfaces requires understanding how the human brain encodes stimuli across modalities such as visual, language (or text), etc. Brain encoding aims at constructing fMRI brain activity given a stimulus. There exists a plethora of neural encoding models which study brain encoding for single mode stimuli: visual (pretrained CNNs) or text (pretrained language models). Few recent papers have also obtained separate visual and text representation models and performed late-fusion using simple heuristics. However, previous work has failed to explore: (a) the effectiveness of image Transformer models for encoding visual stimuli, and (b) co-attentive multi-modal modeling for visual and text reasoning. In this paper, we systematically explore the efficacy of image Transformers (ViT, DEiT, and BEiT) and multi-modal Transformers (VisualBERT, LXMERT, and CLIP) for brain encoding. Extensive experiments on two popular datasets, BOLD5000 and Pereira, provide the following insights. (1) To the best of our knowledge, we are the first to investigate the effectiveness of image and multi-modal Transformers for brain encoding. (2) We find that VisualBERT, a multi-modal Transformer, significantly outperforms previously proposed single-mode CNNs, image Transformers as well as other previously proposed multi-modal models, thereby establishing new state-of-the-art. The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing even when passively viewing images. Future fMRI tasks can verify this computational insight in an appropriate experimental setting.
翻译:启用有效的大脑-计算机界面需要了解人类大脑如何在视觉、语言(或文字)等模式中将被动刺激纳入到诸如视觉、语言(或文字)等模式中。 大脑编码的目的是在刺激下构建 FMRI 大脑活动。 有很多神经编码模型研究单一模式模拟的大脑编码:视觉(受过训练的CNNs)或文字(受过训练的语言模式)。 很少有最近的论文还获得了不同的视觉和文字表述模型,并使用简单的超常处理方法进行了延迟融合。 但是,先前的工作未能探索:(a) 图像变异模型在编码视觉刺激(或文字)等模式中的效果,以及(b) 为视觉和文字推理而共同强化多模式模型。 在本文中,我们系统地探索图像变异器(视觉、DEIT和BIET)和多模式变异器(视觉、LDCMET和CLIP)的功效, 用于大脑变异。 两种流行数据集的广泛实验,BOLD5000和Perira, 提供了以下的洞察。 (l) 对于我们的知识的最好的视觉变异性多模式,我们先是前的变现变现的变现,我们先研究 的变现, 之前的变现的变现, 之前的图像的变现的变现, 我们的变现的变现的变现的变的变现的变现的变的变现的变现的变的变现,我们的变的变的图像的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变现,我们的变的变的变。