Data cleaning is naturally framed as probabilistic inference in a generative model of ground-truth data and likely errors, but the diversity of real-world error patterns and the hardness of inference make Bayesian approaches difficult to automate. We present PClean, a probabilistic programming language (PPL) for leveraging dataset-specific knowledge to automate Bayesian cleaning. Compared to general-purpose PPLs, PClean tackles a restricted problem domain, enabling three modeling and inference innovations: (1) a non-parametric model of relational database instances, which users' programs customize; (2) a novel sequential Monte Carlo inference algorithm that exploits the structure of PClean's model class; and (3) a compiler that generates near-optimal SMC proposals and blocked-Gibbs rejuvenation kernels based on the user's model and data. We show empirically that short (< 50-line) PClean programs can: be faster and more accurate than generic PPL inference on data-cleaning benchmarks; match state-of-the-art data-cleaning systems in terms of accuracy and runtime (unlike generic PPL inference in the same runtime); and scale to real-world datasets with millions of records.
翻译:数据清理自然被设计成地面真实数据和可能错误的基因化模型中的概率推断,但现实世界错误模式的多样性和推论的难度使得巴伊西亚方法难以自动化。我们介绍了PClean,一种概率化编程语言(PPL),一种概率化编程语言(PPL),用以利用数据集特定知识来自动进行巴伊西亚清洁。与一般用途PPPL相比,PClean处理一个有限的问题域,促成三种模型和推论创新:(1) 一种非参数化的关系数据库实例模型,用户的程序是定制的;(2) 一种创新的连续的Monte Carlo推论算法,利用PClean模型类的结构;(3) 一种编译器,根据用户模型和数据生成接近最佳的SMC建议和阻断的Gibbs再生内核。 我们从经验上表明,短( < 50线)PClean程序可以比通用PL推法更快捷、更准确,在数据清理基准上定制; 将通用数据库的精确度与正在运行的数百万个数据记录相像的州和类似标准比标准。