Testing Machine Learning (ML) models and AI-Infused Applications (AIIAs), or systems that contain ML models, is highly challenging. In addition to the challenges of testing classical software, it is acceptable and expected that statistical ML models sometimes output incorrect results. A major challenge is to determine when the level of incorrectness, e.g., model accuracy or F1 score for classifiers, is acceptable and when it is not. In addition to business requirements that should provide a threshold, it is a best practice to require any proposed ML solution to out-perform simple baseline models, such as a decision tree. We have developed complexity measures, which quantify how difficult given observations are to assign to their true class label; these measures can then be used to automatically determine a baseline performance threshold. These measures are superior to the best practice baseline in that, for a linear computation cost, they also quantify each observation' classification complexity in an explainable form, regardless of the classifier model used. Our experiments with both numeric synthetic data and real natural language chatbot data demonstrate that the complexity measures effectively highlight data regions and observations that are likely to be misclassified.
翻译:测试机器学习模型和AI-Infed Applications(AIIAs)模型和AI-Infused(AIIAs)模型或含有ML模型的系统非常具有挑战性。除了测试古典软件的挑战外,可以接受并预期统计ML模型有时会产生不正确的结果。一个重大挑战是确定不正确的程度,例如模型精确度或分类者的F1分等,何时可以接受,何时不能接受。除了应提供一个阈值的商业要求外,要求任何拟议的ML解决方案都超越诸如决策树这样的简单基准模型。我们制定了复杂度措施,量化了所给出的观察结果如何难以分配到真正的类别标签;然后,这些措施可用于自动确定基线性能阈值。就线性计算成本而言,这些措施还优于最佳做法基线,以可解释的形式量化每个观察的分类复杂性,而不论使用的分类模型是何种。我们用数字合成数据和真实的自然语言聊天模型进行的实验都表明,这些复杂度有效地强调了数据区域和观察结果可能会被错误分类。