The distribution gap between training datasets and data encountered in production is well acknowledged. Training datasets are often constructed over a fixed period of time and by carefully curating the data to be labeled. Thus, training datasets may not contain all possible variations of data that could be encountered in real-world production environments. Tasked with building an entity resolution system - a model that identifies and consolidates data points that represent the same person - our first model exhibited a clear training-production performance gap. In this case study, we discuss our human-in-the-loop enabled, data-centric solution to closing the training-production performance divergence. We conclude with takeaways that apply to data-centric learning at large.
翻译:培训数据集与生产过程中遇到的数据之间的分布差距得到了广泛的承认,培训数据集往往是在一个固定的时期内构建的,通过仔细整理需要标注的数据。因此,培训数据集可能并不包含真实世界生产环境中可能遇到的所有数据变化。任务在于建立一个实体解析系统——一个识别和整合代表同一人的数据点的模式——我们的第一个模型展示了一个明确的培训-生产绩效差距。在这个案例研究中,我们讨论了缩小培训-生产绩效差异的以人为主、以数据为中心的解决方案。我们与适用于以数据为中心的总体学习的取舍者达成结论。