Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual semantics crucial for recommendation; (2) ID embeddings capture irreplaceable collaborative signals, rendering fusion strictly superior to replacement; and (3) the effectiveness of intermediate decoder features varies significantly across layers. Guided by these insights, we propose the Dual Feature Fusion (DFF) Framework, a lightweight and plug-and-play approach that adaptively fuses multi-layer representations from frozen LVLMs with item ID embeddings. DFF achieves state-of-the-art performance on two real-world micro-video recommendation benchmarks, consistently outperforming strong baselines and providing a principled approach to integrating off-the-shelf large vision-language models into micro-video recommender systems.
翻译:冻结大型视频语言模型因其强大的多模态理解能力,正日益广泛地应用于微视频推荐领域。然而,现有研究缺乏对其整合效果的系统性实证评估:实践者通常将LVLMs作为固定的黑盒特征提取器使用,而未系统比较其他表征策略。为填补这一空白,我们首次围绕两个关键设计维度展开系统性实证研究:(i)与ID嵌入的整合策略,具体包括替换与融合两种方式;(ii)特征提取范式,比较LVLM生成的文本描述与解码器中间隐藏状态。在代表性LVLMs上的大量实验揭示了三个核心原则:(1)解码器中间隐藏状态始终优于基于文本描述的表征,因为自然语言概括不可避免地会丢失对推荐至关重要的细粒度视觉语义;(2)ID嵌入能够捕捉不可替代的协同信号,使得融合策略严格优于替换策略;(3)解码器中间层特征的有效性在不同层级间存在显著差异。基于这些发现,我们提出了双特征融合框架——一种轻量级即插即用方法,能够自适应地将冻结LVLMs的多层表征与物品ID嵌入进行融合。该框架在两个真实微视频推荐基准测试中取得了最先进的性能,持续超越现有强基线模型,为将现成的大型视觉语言模型整合到微视频推荐系统提供了原则性方法。