Despite the growing integration of retrieval-enabled AI agents into society, their safety and ethical behavior remain inadequately understood. In particular, the integration of LLMs and AI agents with external information sources and real-world environments raises critical questions about how they engage with and are influenced by these external data sources and interactive contexts. This study investigates how expanding retrieval access -- from no external sources to Wikipedia-based retrieval and open web search -- affects model reliability, bias propagation, and harmful content generation. Through extensive benchmarking of censored and uncensored LLMs and AI agents, our findings reveal a consistent degradation in refusal rates, bias sensitivity, and harmfulness safeguards as models gain broader access to external sources, culminating in a phenomenon we term safety degradation. Notably, retrieval-enabled agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval. This effect persists even under strong retrieval accuracy and prompt-based mitigation, suggesting that the mere presence of retrieved content reshapes model behavior in structurally unsafe ways. These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-enabled and increasingly autonomous AI systems.
翻译:尽管具备检索功能的人工智能代理日益融入社会,但其安全性与伦理行为仍未得到充分理解。特别是大型语言模型和人工智能代理与外部信息源及现实环境的整合,引发了关于它们如何与这些外部数据源和交互情境互动并受其影响的关键问题。本研究探讨了扩展检索访问范围——从无外部源到基于维基百科的检索及开放网络搜索——如何影响模型可靠性、偏见传播和有害内容生成。通过对经过审查和未经审查的大型语言模型及人工智能代理进行广泛基准测试,我们的研究结果表明:随着模型获得更广泛的外部信息访问权限,其拒绝率、偏见敏感性和有害内容防护机制呈现出一致的退化趋势,最终形成我们称之为安全性退化的现象。值得注意的是,基于对齐大型语言模型构建的具备检索功能的代理,其行为往往比不具备检索功能的未经审查模型更具风险性。即使在检索准确率极高且采用提示词缓解策略的情况下,这种效应依然存在,这表明检索内容的存在本身就会以结构上不安全的方式重塑模型行为。这些发现强调了开发强健缓解策略的必要性,以确保具备检索功能且日益自主的人工智能系统的公平性与可靠性。