解码视觉神经表征：通过脑-视觉-语言特征的多模态学习 (Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features)

Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to generalize to novel categories that have no corresponding neural data for training. The two main reasons are 1) the under-exploitation of the multimodal semantic knowledge underlying the neural data and 2) the small number of paired (stimuli-responses) training data. To overcome these limitations, this paper presents a generic neural decoding method called BraVL that uses multimodal learning of brain-visual-linguistic features. We focus on modeling the relationships between brain, visual and linguistic features via multimodal deep generative models. Specifically, we leverage the mixture-of-product-of-experts formulation to infer a latent code that enables a coherent joint generation of all three modalities. To learn a more consistent joint representation and improve the data efficiency in the case of limited brain activity data, we exploit both intra- and inter-modality mutual information maximization regularization terms. In particular, our BraVL model can be trained under various semi-supervised scenarios to incorporate the visual and textual features obtained from the extra categories. Finally, we construct three trimodal matching datasets, and the extensive experiments lead to some interesting conclusions and cognitive insights: 1) decoding novel visual categories from human brain activity is practically possible with good accuracy; 2) decoding models using the combination of visual and linguistic features perform much better than those using either of them alone; 3) visual perception may be accompanied by linguistic influences to represent the semantics of visual stimuli. Code and data: https://github.com/ChangdeDu/BraVL.

翻译：摘要：解码人类视觉神经表征是一个具有科学意义且具有挑战性的任务，可揭示视觉处理机制并开发类脑智能机器。大多数现有方法难以推广到没有相应神经数据进行训练的新类别。其中，两个主要原因是1）未充分利用神经数据下底层的多模态语义知识；2）成对的（刺激-响应）训练数据较少。为了克服这些限制，本文提出了一种名为BraVL的通用神经解码方法，它使用脑-视觉-语言特征的多模态学习。我们着眼于通过多模态深度生成模型建模脑、视觉和语言特征之间的关系。具体而言，我们利用极大化产品专家的混合形式来推断潜在代码，从而使所有三种模式能够进行连贯的联合生成。为了学习更一致的联合表示并提高在脑活动数据有限的情况下的数据效率，我们利用了内部和交叉模态互信息最大化正则化项。特别地，我们的BraVL模型可在各种半监督情况下进行训练，以融合从额外类别中获得的视觉和文本特征。最后，我们构建了三个三模匹配数据集，广泛的实验得出了一些有趣的结论和认知见解：1）从人类脑活动中解码新的视觉类别是可行的，且准确率较高；2）使用视觉和语言特征组合的解码模型比仅使用其中一个特征的模型表现更好；3）视觉感知可能伴随语言影响来表示视觉刺激的语义。代码和数据：https://github.com/ChangdeDu/BraVL。