Large language models are becoming increasingly pervasive and ubiquitous in society via deployment in sociotechnical systems. Yet these language models, be it for classification or generation, have been shown to be biased and behave irresponsibly, causing harm to people at scale. It is crucial to audit these language models rigorously. Existing auditing tools leverage either or both humans and AI to find failures. In this work, we draw upon literature in human-AI collaboration and sensemaking, and conduct interviews with research experts in safe and fair AI, to build upon the auditing tool: AdaTest (Ribeiro and Lundberg, 2022), which is powered by a generative large language model (LLM). Through the design process we highlight the importance of sensemaking and human-AI communication to leverage complementary strengths of humans and generative models in collaborative auditing. To evaluate the effectiveness of the augmented tool, AdaTest++, we conduct user studies with participants auditing two commercial language models: OpenAI's GPT-3 and Azure's sentiment analysis model. Qualitative analysis shows that AdaTest++ effectively leverages human strengths such as schematization, hypothesis formation and testing. Further, with our tool, participants identified a variety of failures modes, covering 26 different topics over 2 tasks, that have been shown before in formal audits and also those previously under-reported.
翻译:大型语言模型通过部署在社会技术系统中越来越普遍和广泛。然而,这些语言模型,无论是用于分类还是生成,都已经被证明存在偏见和不负责任的行为,从而对人们造成了大规模的伤害。因此,严格审计这些语言模型至关重要。现有的审计工具利用人类和/或人工智能来发现故障。在这项工作中,我们借鉴了人工智能协作和感知的文献,并与安全和公平人工智能研究专家进行了访谈,以建立审计工具AdaTest(Ribeiro and Lundberg,2022),其由生成式大型语言模型(LLM)驱动。通过设计过程,我们强调了理解和人工智能之间的沟通的重要性,以利用人类和生成模型在协作审计中的互补优势。为了评估增强的工具AdaTest ++的有效性,我们与参与审核两个商业语言模型:OpenAI的GPT-3和Azure的情感分析模型的参与者进行用户研究。定性分析表明,AdaTest ++有效地利用了人类的优势,如图表化、假设形成和测试。此外,通过我们的工具,参与者发现了各种失败模式,涵盖了2项任务中的26个不同主题,这些主题以前在正式审计中已经被证明存在,也包括以前未被报道的主题。