FreaAI:自动提取数据切片以测试机器学习模式 (FreaAI: Automated extraction of data slices to test machine learning models)

Machine learning (ML) solutions are prevalent. However, many challenges exist in making these solutions business-grade. One major challenge is to ensure that the ML solution provides its expected business value. In order to do that, one has to bridge the gap between the way ML model performance is measured and the solution requirements. In previous work (Barash et al, "Bridging the gap...") we demonstrated the effectiveness of utilizing feature models in bridging this gap. Whereas ML performance metrics, such as the accuracy or F1-score of a classifier, typically measure the average ML performance, feature models shed light on explainable data slices that are too far from that average, and therefore might indicate unsatisfied requirements. For example, the overall accuracy of a bank text terms classifier may be very high, say $98\% \pm 2\%$, yet it might perform poorly for terms that include short descriptions and originate from commercial accounts. A business requirement, which may be implicit in the training data, may be to perform well regardless of the type of account and length of the description. Therefore, the under-performing data slice that includes short descriptions and commercial accounts suggests poorly-met requirements. In this paper we show the feasibility of automatically extracting feature models that result in explainable data slices over which the ML solution under-performs. Our novel technique, IBM FreaAI aka FreaAI, extracts such slices from structured ML test data or any other labeled data. We demonstrate that FreaAI can automatically produce explainable and statistically-significant data slices over seven open datasets.

翻译：机械学习(ML)解决方案非常普遍。但是,在使这些解决方案达到商业级别方面存在着许多挑战。一个重大挑战是确保ML解决方案能够提供预期的商业价值。为了做到这一点,我们必须弥合衡量ML模型性能的方法与解决方案要求之间的差距。在以往的工作中(Barash等人,“缩小差距......”),我们展示了利用特征模型来弥补这一差距的实效。ML性能指标,如分类器的准确性或F1分数,通常可以衡量平均 ML性能,特征模型可以说明可解释的数据切片离平均值太远的可解释数据切片,因此可能表明不满足的要求。例如,银行文本分解器的总体准确性可能非常高,比如98美元\ pm 2 ⁇ $ 美元,但在包括短描述和源自商业账户的术语方面可能表现不佳。在培训数据中可能隐含的任何业务要求,可以说明任何可解释的分类类型和描述的长度。因此, 业绩不佳的数据切片中包含短期的FSerreal IM 数据,我们根据短期的缩略性模型展示了该数据。