Queries with aggregation and arithmetic operations, as well as incomplete data, are common in real-world database, but we lack a good understanding of how they should interact. On the one hand, systems based on SQL provide ad-hoc rules for numerical nulls, on the other, theoretical research largely concentrates on the standard notions of certain and possible answers. In the presence of numerical attributes and aggregates, however, these answers are often meaningless, returning either too little or too much. Our goal is to define a principled framework for databases with numerical nulls and answering queries with arithmetic and aggregations over them. Towards this goal, we assume that missing values in numerical attributes are given by probability distributions associated with marked nulls. This yields a model of probabilistic bag databases in which tuples are not necessarily independent, since nulls can repeat. We provide a general compositional framework for query answering, and then concentrate on queries that resemble standard SQL with arithmetic and aggregation. We show that these queries are measurable, and that their outputs have a finite representation. Moreover, since the classical forms of answers provide little information in the numerical setting, we look at the probability that numerical values in output tuples belong to specific intervals. Even though their exact computation is intractable, we show efficient approximation algorithms to compute such probabilities.
翻译:与汇总和算术操作以及数据不完整的查询在现实世界数据库中很常见,但我们对它们如何互动缺乏很好的了解。一方面,基于 SQL 的系统为数字无效物提供临时规则,另一方面,理论研究主要集中于某些答案和可能答案的标准概念。然而,在有数值属性和汇总的情况下,这些答案往往毫无意义,返回得太少或太少。我们的目标是为带有数字无效物的数据库确定一个原则框架,并用算术和汇总回答询问。为了实现这一目标,我们假设数字属性缺失的值是由与标记无效物相关的概率分布给出的。这产生了一个概率包数据库模型,自无效物可以重复以来这些数据库不一定是独立的。我们提供了一个用于回答的一般构成框架,然后集中于类似于标准 SQL 和算术和汇总的查询。我们显示这些查询是可计量的,其产出具有有限的代表性。此外,由于典型的答案形式在数字设定中提供了很少的信息,因此,数字属性的缺失是与标记值相关的概率,我们通过精确的精确度来查看其精确度。