Acyclic schemes posses known benefits for database design, speeding up queries, and reducing space requirements. An acyclic join dependency (AJD) is lossless with respect to a universal relation if joining the projections associated with the schema results in the original universal relation. An intuitive and standard measure of loss entailed by an AJD is the number of redundant tuples generated by the acyclic join. Recent work has shown that the loss of an AJD can also be characterized by an information-theoretic measure. Motivated by the problem of automatically fitting an acyclic schema to a universal relation, we investigate the connection between these two characterizations of loss. We first show that the loss of an AJD is captured using the notion of KL-Divergence. We then show that the KL-divergence can be used to bound the number of redundant tuples. We prove a deterministic lower bound on the percentage of redundant tuples. For an upper bound, we propose a random database model, and establish a high probability bound on the percentage of redundant tuples, which coincides with the lower bound for large databases.
翻译:无环模式对于数据库设计、查询速度的提升和空间要求的减少有着已知的好处。如果关联模式对于通用关系是无损的,则关联谓词是无环的。直观和标准的损失测量方法是所生成的冗余元组的数量。最近的研究表明,关联谓词的损失也可以通过信息熵测量来表征。在自动适应通用关系的无环模式的问题上,我们研究了这两种损失表征之间的关系。我们首先证明了关联谓词的损失可以使用KL散度来刻画。然后我们证明KL散度可以用来限制冗余元组的数量。我们给出了冗余元组的确定性下界。对于上界,我们提出了一个随机数据库模型,并对冗余元组的百分比建立了高概率上界。对于大型数据库,这个上界与下界相一致。