Most data analytical pipelines often encounter the problem of querying inconsistent data that violate pre-determined integrity constraints. Data cleaning is an extensively studied paradigm that singles out a consistent repair of the inconsistent data. Consistent query answering (CQA) is an alternative approach to data cleaning that asks for all tuples guaranteed to be returned by a given query on all (in most cases, exponentially many) repairs of the inconsistent data. This paper identifies a class of acyclic select-project-join (SPJ) queries for which CQA can be solved via SQL rewriting with a linear time guarantee. Our rewriting method can be viewed as a generalization of Yannakakis's algorithm for acyclic joins to the inconsistent setting. We present LinCQA, a system that can output rewritings in both SQL and non-recursive Datalog rules for every query in this class. We show that LinCQA often outperforms the existing CQA systems on both synthetic and real-world workloads, and in some cases, by orders of magnitude.
翻译:大多数数据分析管道常常遇到质疑不一致数据的问题,这违反了预先确定的完整限制。数据清理是一个广泛研究的范例,它挑选出对不一致数据进行一致的修复。一致的查询回答(CQA)是数据清理的替代方法,它要求通过对不一致数据的所有(多数情况下是指数性多的)修复进行特定查询,以所有(大多数情况下是指数性的)修复数据来保证归还所有图例。本文确定了一种周期性选择项目-join(SPJ)查询,可以通过SQL以线性时间保证重写CQA(SPJ)来解决这个问题。我们的重写方法可以被看作是Yannakakis的循环计算法与不一致的设置相结合的概括。我们介绍了LinCQA,这个系统可以在SQL和不精确的数据记录规则中输出该类每项查询的重写内容。我们显示,LincQA常常在合成和现实世界工作量方面超越现有的CQA系统,有些情况下,以数量顺序。