Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a clear understanding of the capabilities and limitations of LLMs is necessary. To this end, we systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM. We find that setting the system parameter of ChatGPT by assigning it a persona, say that of the boxer Muhammad Ali, significantly increases the toxicity of generations. Depending on the persona assigned to ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect stereotypes, harmful dialogue, and hurtful opinions. This may be potentially defamatory to the persona and harmful to an unsuspecting user. Furthermore, we find concerning patterns where specific entities (e.g., certain races) are targeted more than others (3x more) irrespective of the assigned persona, that reflect inherent discriminatory biases in the model. We hope that our findings inspire the broader AI community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy AI systems.
翻译:大型语言模型(LLMs)展现出惊人的能力并超越了自然语言处理(NLP)社区,被广泛应用于医疗、治疗、教育和客户服务等多个领域。由于使用这些系统的用户包括具有重要信息需求的人,如与聊天机器人交互的学生或患者,因此这些系统的安全性至关重要。因此,必须清楚了解LLMs的能力和局限性。为此,我们系统评估了ChatGPT(一种流行的基于对话的LLM)中超过500,000个生成的毒性。我们发现,通过为ChatGPT分配特定人设(如拳击手穆罕默德·阿里),可以显著增加生成的毒性。根据分配给ChatGPT的人设,其毒性可能增加多达6倍,其输出参与不正确的刻板印象、有害的对话和伤害性的观点。这可能会对人设造成潜在诽谤,并对不知情的用户造成伤害。此外,我们发现令人担忧的模式,即特定实体(例如某些种族)被针对的比其他实体多(高达3倍),独立于所分配的人设,反映了该模型内在的歧视偏见。我们希望我们的发现能激励更广泛的AI社区重新思考当前安全防护栏的效力,并开发更好的技术,从而实现强大、安全和值得信赖的AI系统。