Decoding images from fMRI often involves mapping brain activity to CLIP's final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP's intermediate layers and contradicts the brain's functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system's functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7\%(Table.\ref{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.
翻译:从功能磁共振成像(fMRI)解码图像通常涉及将大脑活动映射到CLIP的最终语义层。为捕捉更精细的视觉细节,许多方法会额外引入一个参数量庞大的基于VAE的流程。然而,这些方法忽视了CLIP中间层所蕴含的丰富物体信息,且与大脑的功能层级结构相悖。我们提出了BrainMCLIP,该方法开创了一种参数高效的多层融合策略,该策略以人类视觉系统的功能层级为指导,从而无需引入独立的VAE流程。BrainMCLIP依据功能层级,将来自功能上不同的视觉区域(低/高级)的fMRI信号分别对齐到CLIP相应的中间层和最终层。我们进一步引入了交叉重建策略和一种新颖的多粒度损失函数。结果显示,BrainMCLIP取得了极具竞争力的性能,尤其在高级语义指标上表现出色,达到或超越了当前最先进(SOTA)方法的水平,包括那些使用VAE流程的方法。至关重要的是,它仅使用显著更少的参数量就实现了这一性能。通过避免使用VAE流程,与顶级的基于VAE的SOTA方法相比,参数量减少了71.7%(见表\ref{tab:compare_clip_vae})。通过利用CLIP的中间层特征,BrainMCLIP有效捕捉了仅使用CLIP的方法常常遗漏的视觉细节,在语义准确性和细节保真度之间取得了引人注目的平衡,且无需独立的VAE流程。