The way users acquire information is undergoing a paradigm shift with the advent of ChatGPT. Unlike conventional search engines, ChatGPT retrieves knowledge from the model itself and generates answers for users. ChatGPT's impressive question-answering (QA) capability has attracted more than 100 million users within a short period of time but has also raised concerns regarding its reliability. In this paper, we perform the first large-scale measurement of ChatGPT's reliability in the generic QA scenario with a carefully curated set of 5,695 questions across ten datasets and eight domains. We find that ChatGPT's reliability varies across different domains, especially underperforming in law and science questions. We also demonstrate that system roles, originally designed by OpenAI to allow users to steer ChatGPT's behavior, can impact ChatGPT's reliability. We further show that ChatGPT is vulnerable to adversarial examples, and even a single character change can negatively affect its reliability in certain cases. We believe that our study provides valuable insights into ChatGPT's reliability and underscores the need for strengthening the reliability and security of large language models (LLMs).
翻译:用户获取信息的方式正在经历一种范式转变,随着 ChatGPT 的出现。与传统的搜索引擎不同,ChatGPT 从模型本身检索知识并为用户生成答案。ChatGPT 令人印象深刻的问答能力在短时间内吸引了超过 1 亿的用户,但也引起了人们有关其可靠性的关注。在本文中,我们在精心策划的 8 个领域和 10 个数据集中,针对通用 QA 场景对 ChatGPT 的可靠性进行了首次大规模测量,总共使用了 5,695 个问题。我们发现 ChatGPT 的可靠性在不同的领域之间存在差异,特别是在法律和科学问题方面表现不佳。我们还证明了 OpenAI 设计的系统角色可以影响 ChatGPT 的可靠性。我们进一步展示,ChatGPT 对敌对例子是脆弱的,即使一个字符的变化也可能在某些情况下对其可靠性产生负面影响。我们认为,我们的研究为 ChatGPT 的可靠性提供了有价值的见解,并强调了加强大语言模型(LLMs)的可靠性和安全性的必要性。