Machine learning (ML) models in production pipelines are frequently retrained on the latest partitions of large, continually-growing datasets. Due to engineering bugs, partitions in such datasets almost always have some corrupted features; thus, it's critical to detect data issues and block retraining before downstream ML model accuracy decreases. However, it's difficult to identify when a partition is corrupted enough to block retraining. Blocking too often yields stale model snapshots in production; blocking too little yields broken model snapshots in production. In this paper, we present an automatic data validation system for ML pipelines implemented at Meta. We employ what we call a Partition Summarization (PS) approach to data validation: each timestamp-based partition of data is summarized with data quality metrics, and summaries are compared to detect corrupted partitions. We describe how we can adapt PS for several data validation methods and compare their pros and cons. Since none of the methods by themselves met our requirements for high precision and recall in detecting corruptions, we devised GATE, our high-precision and recall data validation method. GATE gave a 2.1x average improvement in precision over the baseline on a case study with Instagram's data. Finally, we discuss lessons learned from implementing data validation for Meta's production ML pipelines.
翻译:生产管道中的机器学习(ML)模型经常在不断增长的大型数据集的最新分区上重新训练。由于工程故障,这类数据集中的分区几乎总是有一些腐败的特征;因此,在下游ML模型精确度下降之前,对探测数据问题和阻止再培训至关重要;然而,很难确定在何时分区的腐蚀程度足以阻断再培训。封锁往往会在生产过程中产生破碎的模型短片;在生产过程中阻断过少的产量破碎模型。在本文中,我们为Meta实施的ML管道提供了自动数据验证系统。我们在数据验证中使用了我们所称的“分解(PS)”方法:数据基于时间戳的分解(PS)与数据质量衡量标准一起总结,摘要与探测损坏的分区之间比较。我们描述了如何使PS适应若干数据验证方法,并比较其利弊。由于这些方法本身都不符合我们高精确度和回顾腐败的要求,我们设计了GATE,我们的高精确度和回顾数据验证方法。我们用数据校准方法对数据校准方法进行了比较。</s>