ML is being deployed in complex, real-world scenarios where errors have impactful consequences. In these systems, thorough testing of the ML pipelines is critical. A key component in ML deployment pipelines is the curation of labeled training data. Common practice in the ML literature assumes that labels are the ground truth. However, in our experience in a large autonomous vehicle development center, we have found that vendors can often provide erroneous labels, which can lead to downstream safety risks in trained models. To address these issues, we propose a new abstraction, learned observation assertions, and implement it in a system called Fixy. Fixy leverages existing organizational resources, such as existing (possibly noisy) labeled datasets or previously trained ML models, to learn a probabilistic model for finding errors in human- or model-generated labels. Given user-provided features and these existing resources, Fixy learns feature distributions that specify likely and unlikely values (e.g., that a speed of 30mph is likely but 300mph is unlikely). It then uses these feature distributions to score labels for potential errors. We show that FIxy can automatically rank potential errors in real datasets with up to 2$\times$ higher precision compared to recent work on model assertions and standard techniques such as uncertainty sampling.
翻译:在复杂的、真实的情景中,错误会产生影响的后果。在这些系统中,对ML输油管进行彻底测试至关重要。ML部署管道的一个关键组成部分是整理标签培训数据。ML文献中的常见做法是假设标签是地面真相。然而,根据我们在大型自主车辆发展中心的经验,我们发现供应商往往可以提供错误标签,这可能导致经过培训的模型的下游安全风险。为了解决这些问题,我们建议采用一个新的抽象、学习到的观察主张,并在一个称为固定的系统中实施。固定利用现有的组织资源,例如现有的(可能吵闹的)标签数据集或以前受过训练的ML模型,以学习一种在人类或模型生成的标签中发现错误的概率模型。根据用户提供的特性和这些现有资源,固定学习了可能和不可能确定价值的特征分布(例如,速度可能为30厘米,但300厘米是不可能的)。然后,我们利用这些特征分配方法来为最近可能发生的错误进行记分级,例如现有的(可能吵闹的)标签或以前受过训练的ML模型模型模型,以便学习一种在人类或模型产生的标签上找到错误。我们把可能自动比较精确性的数据比为2号。我们把数据比得更精确的样品比作到可能比重。