Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.
翻译:聊天助手日益整合网络搜索功能,使其能够检索并引用外部来源。尽管这有望提供更可靠的答案,但也增加了放大低可信度来源错误信息的风险。本文提出一种评估助手网络搜索行为的新方法,重点关注来源可信度以及回答相对于引用来源的依据性。通过选取五个易出现错误信息的主题中的100条主张,我们对GPT-4o、GPT-5、Perplexity和Qwen Chat进行了评估。研究结果显示各助手之间存在差异:Perplexity实现了最高的来源可信度,而GPT-4o在敏感话题上表现出对非可信来源的引用率偏高。这项工作首次对常用聊天助手的事实核查行为进行了系统比较,为评估高风险信息环境中的AI系统奠定了基础。