Probabilistic databases (PDBs) are probability spaces over database instances. They provide a framework for handling uncertainty in databases, as occurs due to data integration, noisy data, data from unreliable sources or randomized processes. Most of the existing theory literature investigated finite, tuple-independent PDBs (TI-PDBs) where the occurrences of tuples are independent events. Only recently, Grohe and Lindner (PODS '19) introduced independence assumptions for PDBs beyond the finite domain assumption. In the finite, a major argument for discussing the theoretical properties of TI-PDBs is that they can be used to represent any finite PDB via views. This is no longer the case once the number of tuples is countably infinite. In this paper, we systematically study the representability of infinite PDBs in terms of TI-PDBs and the related block-independent disjoint PDBs. The central question is which infinite PDBs are representable as first-order views over tuple-independent PDBs. We give a necessary condition for the representability of PDBs and provide a sufficient criterion for representability in terms of the probability distribution of a PDB. With various examples, we explore the limits of our criteria. We show that conditioning on first order properties yields no additional power in terms of expressivity. Finally, we discuss the relation between purely logical and arithmetic reasons for (non-)representability.
翻译:概率数据库(PDBs)是数据库实例的概率空间。它们为处理数据库不确定性提供了一个框架,因为数据整合、数据噪音、不可靠来源或随机化过程导致的不确定性。大多数现有理论文献都调查了有限、图普尔独立的PDB(TI-PDBs),其中出现图普尔事件是独立的事件。仅在最近,Grohe和Lindner(PODS'19)为PDBs引入了超出有限域假设范围的独立假设。在有限范围内,讨论TI-PDB的理论属性的一个主要理由是,它们可以通过视图代表任何有限的 PDB。一旦图普的数量可观无限,情况就不再是这种情况了。在本文中,我们系统地研究无限的PDBs(TI-PDs)和相关的区块不相连接 PDBs的可代表性。我们用各种直径直径不相依的逻辑变量来代表无限的PDBs 。我们用一个必要条件来代表PDBs的可代表性,我们用直径直径直径直径直的直径直的直径直的参数,我们用直径直径直径直径直径直的直的逻辑定义的概率关系标准的可判标标标度,我们用概率性解释性解释性解释性解释的概率性标准的概率性解释性解释性解释性参数的概率性,我们不下标的概率性解释性解释性解释。