Reconstructing visual information from brain activity via computer vision technology provides an intuitive understanding of visual neural mechanisms. Despite progress in decoding fMRI data with generative models, achieving accurate cross-subject reconstruction of visual stimuli remains challenging and computationally demanding. This difficulty arises from inter-subject variability in neural representations and the brain's abstract encoding of core semantic features in complex visual inputs. To address these challenges, we propose NeuroSwift, which integrates complementary adapters via diffusion: AutoKL for low-level features and CLIP for semantics. NeuroSwift's CLIP Adapter is trained on Stable Diffusion generated images paired with COCO captions to emulate higher visual cortex encoding. For cross-subject generalization, we pretrain on one subject and then fine-tune only 17 percent of parameters (fully connected layers) for new subjects, while freezing other components. This enables state-of-the-art performance with only one hour of training per subject on lightweight GPUs (three RTX 4090), and it outperforms existing methods.
翻译:通过计算机视觉技术从大脑活动中重建视觉信息,为理解视觉神经机制提供了直观途径。尽管生成模型在解码fMRI数据方面已取得进展,但实现视觉刺激的精确跨被试重建仍面临挑战且计算成本高昂。这一困难源于神经表征的个体间差异,以及大脑对复杂视觉输入中核心语义特征的抽象编码特性。为应对这些挑战,我们提出NeuroSwift框架,其通过扩散模型整合互补适配器:用于低级特征的AutoKL与用于语义的CLIP。NeuroSwift的CLIP适配器在Stable Diffusion生成图像与COCO描述文本配对数据上进行训练,以模拟高级视觉皮层编码机制。为实现跨被试泛化,我们在单一被试数据上预训练后,仅对新被试微调17%的参数(全连接层),其余组件保持冻结。该方法在轻量级GPU(三张RTX 4090)上仅需单被试一小时训练即可达到最优性能,且显著超越现有方法。