Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.
翻译:生成式搜索引擎直接针对用户查询生成响应并附有内联引用。值得信赖的生成式搜索引擎必备的特征是可验证性,即系统应具有全面引用(高引用召回率;所有语句都得到充分支持)和准确引用(高引用精度;每个引用都支持其关联语句)的能力。我们对四个流行的生成式搜索引擎——Bing Chat、NeevaAI、perplexity.ai和YouChat在各种来源(例如,从Google历史用户查询收集的,Reddit上动态收集的开放性问题等)的不同查询集上进行人类评估。我们发现,现有的生成式搜索引擎响应流畅、看似信息丰富,但经常包含不支持的语句和不准确的引用:平均来说,仅有51.5%的生成语句被充分引用支持、只有74.5%的引用支持其关联语句。我们认为,考虑到它们表现出的可信度,这些结果对于可能作为信息获取用户的主要工具的系统而言是令人担忧的。我们希望我们的结果进一步推动值得信任的生成式搜索引擎的发展,并帮助研究人员和用户更好地理解现有商业系统的缺点。