The approximate sorting for big data is considered in this paper. The goal of approximate sorting for big data is to generate an approximate sorted result, but using less CPU and I/O cost. For big data, we consider the approximate sorting in I/O model. The existing metrics on permutation space are not available for external approximate sorting algorithms. Thus, we propose a new kind of metric named External metric, which ignores the errors and dislocation that happened in each I/O block.The External Spearmans footrule metric is an example of external metric for Spearmans footrule metric. Furthermore, to facilitate a better evaluation of the approximate sorted result, we propose a new metric, named as errors, which directly states the number of dislocation of the elements. Its external metric external errors is also considered in this paper. Then, according to the rate-distortion relationship endowed by these two metrics, the lower bound of these two metrics on external approximate sorting problem with t I/O operations is proved. We propose a k-pass external approximate sorting algorithm, named as EASORT, and prove that EASORT is asymptotically optimal. Finally, we consider the applications on approximate sorting results. An index for the result of our approximate sorting is proposed and analyze the single and range query on approximate sorted result using this index. Further, the sort-merge join on two relations, where one of the relations is approximate sorted or both relations are approximate sorted, are all discussed in this paper.
翻译:本文考虑了大数据下的近似排序问题。在大数据下,近似排序的目标是生成一个近似排序的结果,但使用更少的CPU和I/O成本。我们考虑I/O模型下的近似排序问题。现有的置换空间指标不适用于外部近似排序算法。因此,我们提出了一种新的外部指标,称为“外部度量标准”,该标准忽略了每个I/O块中发生的错误和错位。External Spearmans footrule metric是Spearmans footrule metric的一个外部度量标准示例。此外,为了更好地评估近似排序结果,我们提出了一种新的度量标准,称为“errors”,直接说明元素的错位数量。我们还考虑了它的外部指标“external errors”。然后,根据这两个度量标准所赋予的速率失真关系,证明了这两个度量标准在进行t次I/O操作的外部近似排序问题时的下限。我们提出了一种k-pass外部近似排序算法,称为“EASORT”,并证明了EASORT是渐近最优的。最后,我们考虑了近似排序结果的应用。提出了适用于我们的近似排序结果的索引,并使用这个索引分析了单查询和范围查询在近似排序结果上的表现。此外,我们还讨论了在两个关系(其中一个关系为近似排序或两个关系均为近似排序)之间进行排序-合并连接的方法。