Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join and union, given a set of joins, we study the problem of obtaining a random sample from the union of joins without performing the full join and union. We present a general framework for random sampling over the set union of chain, acyclic, and cyclic joins, with sample uniformity and independence guarantees. We study the novel problem of the union of joins size evaluation and propose two approximation methods based on histograms of columns and random walks on data. We propose an online union sampling framework that initializes with cheap-to-calculate parameter approximations and refines them on the fly during sampling. We evaluate our framework on workloads from the TPC-H benchmark and explore the trade-off of the accuracy of union approximation and sampling efficiency.
翻译:数据科学家往往利用多种关联数据来源进行分析。学习和近似查询回答的标准假设是,数据是基础分布的统一和独立抽样。为了避免加入和工会的成本,考虑到一系列加入,我们研究在不完全加入和工会的情况下从加入的工会获得随机抽样的问题。我们提出了一个对链、单圈和环圈组合进行随机抽样的总框架,并有统一性和独立性的抽样保证。我们研究了合并规模评价的新问题,并提出了基于柱子直方图和随机数据行走的两种近似方法。我们提议了一个在线工会抽样框架,以廉价到计算参数近似值为初始,并在取样期间对飞行进行精细化。我们从TPC-H基准中评估我们的工作量框架,并探讨工会近似和抽样效率的准确性。</s>