Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.
翻译:对比语言-图像模型(如CLIP)已展现出卓越的泛化能力。然而,其内部视觉表征在训练过程中如何演化,以及这种演化与人类感知的关系,目前仍缺乏深入理解。现有分析大多针对完全训练好的模型,而对表征偏好与感知对齐的动态过程尚未充分探索。本研究对CLIP模型在整个训练周期进行了逐轮次分析,重点关注纹理-形状偏好的演化、与人类感知判断的对齐程度,以及对图像噪声的敏感性。通过使用涵盖低层图像质量评估、中层感知相似性、显著性对应及噪声鲁棒性的多维度感知基准测试,我们发现了一种具有训练阶段依赖性的、一致的表征转变规律。训练早期阶段表现出强烈的纹理偏好,与人类低层感知度量高度对齐,且对高斯噪声扰动更为敏感。随着训练推进,纹理偏好逐渐减弱,转向以形状为基础的表征方式,同时伴随着噪声鲁棒性的提升和低层感知对齐程度的下降。值得注意的是,这一动态现象在多个不同规模的CLIP模型中均被一致观察到,表明该现象并非特定架构规模所独有。我们的研究结果为多模态模型训练过程中感知对齐、特征偏好与鲁棒性如何协同演化提供了实证描述。这项工作揭示了早期低层感知对齐与后期鲁棒性之间的系统性权衡,为视觉-语言模型的表征动力学及其与人类视觉处理的关系提供了新的见解。