Generative Search Engines (GSEs) leverage Retrieval-Augmented Generation (RAG) techniques and Large Language Models (LLMs) to integrate multi-source information and provide users with accurate and comprehensive responses. Unlike traditional search engines that present results in ranked lists, GSEs shift users' attention from sequential browsing to content-driven subjective perception, driving a paradigm shift in information retrieval. In this context, enhancing the subjective visibility of content through Generative Search Engine Optimization (G-SEO) methods has emerged as a new research focus. With the rapid advancement of Multimodal Retrieval-Augmented Generation (MRAG) techniques, GSEs can now efficiently integrate text, images, audio, and video, producing richer responses that better satisfy complex information needs. Existing G-SEO methods, however, remain limited to text-based optimization and fail to fully exploit multimodal data. To address this gap, we propose Caption Injection, the first multimodal G-SEO approach, which extracts captions from images and injects them into textual content, integrating visual semantics to enhance the subjective visibility of content in generative search scenarios. We systematically evaluate Caption Injection on MRAMG, a benchmark for MRAG, under both unimodal and multimodal settings. Experimental results show that Caption Injection significantly outperforms text-only G-SEO baselines under the G-Eval metric, demonstrating the necessity and effectiveness of multimodal integration in G-SEO to improve user-perceived content visibility.
翻译:生成式搜索引擎(GSEs)通过检索增强生成(RAG)技术与大语言模型(LLMs)整合多源信息,为用户提供准确而全面的回答。与传统搜索引擎以排序列表呈现结果不同,GSEs将用户注意力从顺序浏览转向内容驱动的主观感知,推动了信息检索范式的转变。在此背景下,通过生成式搜索引擎优化(G-SEO)方法提升内容的主观可见性已成为新的研究热点。随着多模态检索增强生成(MRAG)技术的快速发展,GSEs现已能高效整合文本、图像、音频和视频,生成更丰富的响应以更好地满足复杂信息需求。然而,现有G-SEO方法仍局限于基于文本的优化,未能充分利用多模态数据。为填补这一空白,我们提出Caption Injection(标题注入)——首个多模态G-SEO方法,该方法从图像中提取标题并注入文本内容,通过融合视觉语义来增强生成式搜索场景中内容的主观可见性。我们在MRAG基准测试集MRAMG上,分别针对单模态和多模态设置系统评估了Caption Injection。实验结果表明,在G-Eval指标下,Caption Injection显著优于纯文本G-SEO基线方法,证明了多模态整合在G-SEO中对于提升用户感知内容可见性的必要性与有效性。