Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are equal. With finite time and resources, ML practitioners are forced to prioritize their model debugging and improvement efforts. Through interviews with 13 ML practitioners at Apple, we found that practitioners construct small targeted test sets to estimate an error's nature, scope, and impact on users. We built on this insight in a case study with machine translation models, and developed Angler, an interactive visual analytics tool to help practitioners prioritize model improvements. In a user study with 7 machine translation experts, we used Angler to understand prioritization practices when the input space is infinite, and obtaining reliable signals of model quality is expensive. Our study revealed that participants could form more interesting and user-focused hypotheses for prioritization by analyzing quantitative summary statistics and qualitatively assessing data by reading sentences.
翻译:机器学习 (ML) 模型在现实世界中可能以意外的方式失败,但不是所有的模型失效都是相等的。由于时间和资源有限,ML 实践者被迫优先考虑其模型调试和改进的工作。通过对苹果公司 13 位 ML 实践者进行访谈,我们发现实践者构建小型目标测试集以估计错误的性质、范围和对用户的影响。我们在机器翻译模型的案例研究中实践了这一洞察,并开发了 Angler,一种交互式视觉分析工具,帮助实践者优先处理模型改进。在与 7 位机器翻译专家进行的用户研究中,我们使用 Angler 来了解当输入空间是无限的、可靠信号的模型质量昂贵的情况下,优先处理实践的做法。我们的研究揭示了,参与者可以通过分析数量摘要统计数据和通过阅读句子来 qualitatively 评估数据,形成更有趣和更为用户关注的优先级假设。