Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
翻译:早期儿童的发展轨迹为视觉基础模型的样本高效预训练提供了一个自然目标。我们提出了BabyVLM-V2,这是一个基于发育基础的、受婴儿启发的视觉语言建模框架,它通过一个纵向、多方面的预训练数据集、一个多功能模型,以及最重要的——用于认知评估的DevCV工具箱,对BabyVLM-V1进行了全面改进。该预训练数据集在最大化覆盖范围的同时,最小化了对一个纵向、以婴儿为中心的视听语料库的筛选工作,生成了视频-话语、图像-话语和多轮对话数据,这些数据反映了婴儿的体验。DevCV工具箱将最近发布的NIH婴儿工具箱中所有与视觉相关的测量指标,适配为一个包含十项多模态任务的基准测试套件,涵盖了与早期儿童能力相匹配的空间推理、记忆和词汇理解。实验结果表明,一个从头开始预训练的紧凑模型可以在DevCV工具箱上达到有竞争力的性能,在某些任务上甚至超越了GPT-4o。我们希望这个原则化、统一的BabyVLM-V2框架能够加速在符合发育规律的视觉基础模型预训练方面的研究。