AI systems can fail to learn important behaviors, leading to real-world issues like safety concerns and biases. Discovering these systematic failures often requires significant developer attention, from hypothesizing potential edge cases to collecting evidence and validating patterns. To scale and streamline this process, we introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors. We also design and implement Deblinder, a visual analytics system for synthesizing failure reports that developers can use to discover and validate systematic failures. In semi-structured interviews and think-aloud studies with 10 AI practitioners, we explore the affordances of the Deblinder system and the applicability of failure reports in real-world settings. Lastly, we show how collecting additional data from the groups identified by developers can improve model performance.
翻译:AI系统可能无法学习重要的行为,导致安全关切和偏向等真实世界问题。发现这些系统性的失败往往需要开发者大力关注,从假设潜在边缘案例到收集证据和验证模式。为了扩大和简化这一过程,我们引入了多方源故障报告、终端用户描述模型如何失败或为什么失败,并展示开发者如何利用模型发现AI错误。我们还设计和实施了Debliner,这是一个视觉分析系统,用于合成开发者可用来发现和验证系统性失败的报告。在与10名AI实践者进行的半结构化访谈和智囊研究中,我们探索了Debliner系统的价格和故障报告在现实世界环境中的可适用性。最后,我们展示了如何从开发者确定的团体收集更多数据来改进模型性能。