Understanding how the brain encodes external stimuli and how these stimuli can be decoded from the measured brain activities are long-standing and challenging questions in neuroscience. In this paper, we focus on reconstructing the complex image stimuli from fMRI (functional magnetic resonance imaging) signals. Unlike previous works that reconstruct images with single objects or simple shapes, our work aims to reconstruct image stimuli that are rich in semantics, closer to everyday scenes, and can reveal more perspectives. However, data scarcity of fMRI datasets is the main obstacle to applying state-of-the-art deep learning models to this problem. We find that incorporating an additional text modality is beneficial for the reconstruction problem compared to directly translating brain signals to images. Therefore, the modalities involved in our method are: (i) voxel-level fMRI signals, (ii) observed images that trigger the brain signals, and (iii) textual description of the images. To further address data scarcity, we leverage an aligned vision-language latent space pre-trained on massive datasets. Instead of training models from scratch to find a latent space shared by the three modalities, we encode fMRI signals into this pre-aligned latent space. Then, conditioned on embeddings in this space, we reconstruct images with a generative model. The reconstructed images from our pipeline balance both naturalness and fidelity: they are photo-realistic and capture the ground truth image contents well.
翻译:理解大脑如何将外部刺激编码,以及这些刺激如何从测量的大脑活动中解码出来,这是神经科学中长期存在且具有挑战性的问题。 在本文件中,我们的重点是从 FMRI (功能磁共振成像) 信号中重建复杂的图像刺激。 与以前用单一对象或简单形状重建图像的工作不同,我们的工作旨在重建具有丰富的语义、更接近日常场景和能够揭示更多视角的图像刺激。 然而,FMRI数据集缺乏数据是应用最新深层次学习模型解决该问题的主要障碍。 我们发现,添加新的文本模式有利于重建问题,而不是直接将大脑信号转换为图像。 因此,我们方法所涉及的模式是:(一) oxel-level FMRI信号, (二) 观察到触发大脑信号的图像,以及(三) 图像的文字模型描述。为了进一步解决数据内存问题,我们利用了对大规模数据集进行校准的视觉潜在空间空间预留层空间。 而不是将模型从稳定到历史再构建, 我们用这个空间的模型, 将这种稳定的模型, 与一个潜在的空间再定位的模型一起, 找到一个稳定的空间的模型。