"Bad" data has a direct impact on 88% of companies, with the average company losing 12% of its revenue due to it. Duplicates - multiple but different representations of the same real-world entities - are among the main reasons for poor data quality, so finding and configuring the right deduplication solution is essential. Existing data matching benchmarks focus on the quality of matching results and neglect other important factors, such as business requirements. Additionally, they often do not support the exploration of data matching results. To address this gap between the mere counting of record pairs vs. a comprehensive means to evaluate data matching solutions, we present the Frost platform. It combines existing benchmarks, established quality metrics, cost and effort metrics, and exploration techniques, making it the first platform to allow systematic exploration to understand matching results. Frost is implemented and published in the open-source application Snowman, which includes the visual exploration of matching results.
翻译:“ 错误”数据直接影响到88%的公司,平均公司因此损失了12%的收入。重复数据是数据质量差的主要原因之一,因此,寻找和配置正确的解析解决方案至关重要。现有的数据匹配基准侧重于匹配结果的质量,忽视其他重要因素,如商业要求。此外,它们往往不支持数据匹配结果的探索。为了解决仅仅计算记录对对与评估数据匹配解决方案的综合手段之间的差距,我们提出了Frost平台。它综合了现有的基准、既定的质量衡量标准、成本和努力度量度以及勘探技术,使其成为第一个允许系统探索以了解匹配结果的平台。Frost在开放源应用程序Snowman中实施和发布,其中包括对匹配结果的视觉探索。