将数据驱动监督与 " 流动中人 " 反馈相结合,以利实体决议 (Combining Data-driven Supervision with Human-in-the-loop Feedback for Entity Resolution)

The distribution gap between training datasets and data encountered in production is well acknowledged. Training datasets are often constructed over a fixed period of time and by carefully curating the data to be labeled. Thus, training datasets may not contain all possible variations of data that could be encountered in real-world production environments. Tasked with building an entity resolution system - a model that identifies and consolidates data points that represent the same person - our first model exhibited a clear training-production performance gap. In this case study, we discuss our human-in-the-loop enabled, data-centric solution to closing the training-production performance divergence. We conclude with takeaways that apply to data-centric learning at large.

翻译：培训数据集与生产过程中遇到的数据之间的分布差距得到了广泛的承认,培训数据集往往是在一个固定的时期内构建的,通过仔细整理需要标注的数据。因此,培训数据集可能并不包含真实世界生产环境中可能遇到的所有数据变化。任务在于建立一个实体解析系统——一个识别和整合代表同一人的数据点的模式——我们的第一个模型展示了一个明确的培训-生产绩效差距。在这个案例研究中,我们讨论了缩小培训-生产绩效差异的以人为主、以数据为中心的解决方案。我们与适用于以数据为中心的总体学习的取舍者达成结论。

相关内容

实体解析

关注 5

不同的数据提供方对同一个事物即实体 (Entity)可能会有不同的描述 (这里的描述包括数据格式、表示方法等) ，每一个对实体的描述称为该实体的一个引用。实体解析，是指从一个“ 引用集合”中解析并映射到现实世界中的“ 实体”过程。实体解析(Entity Resolution)又被称为记录链接(Record Linkage) 、对象识别(object Identification ) 、个体识别(Individual Identification) 、重复检测(Duplicate Detection)

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【IJCAI2020】从语言图谱到常识图谱，TransOMCS: From Linguistic Graphs to Commonsense Knowledge

专知会员服务

40+阅读 · 2020年5月4日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日