With the continued digitization of societal processes, we are seeing an explosion in available data. This is referred to as big data. In a research setting, three aspects of the data are often viewed as the main sources of challenges when attempting to enable value creation from big data: volume, velocity, and variety. Many studies address volume or velocity, while fewer studies concern the variety. Metric spaces are ideal for addressing variety because they can accommodate any data as long as it can be equipped with a distance notion that satisfies the triangle inequality. To accelerate search in metric spaces, a collection of indexing techniques for metric data have been proposed. However, existing surveys offer limited coverage, and a comprehensive empirical study exists has yet to be reported. We offer a comprehensive survey of existing metric indexes that support exact similarity search: we summarize existing partitioning, pruning, and validation techniques used by metric indexes to support exact similarity search; we provide the time and space complexity analyses of index construction; and we offer an empirical comparison of their query processing performance. Empirical studies are important when evaluating metric indexing performance, because performance can depend highly on the effectiveness of available pruning and validation as well as on the data distribution, which means that complexity analyses often offer limited insights. This article aims at revealing strengths and weaknesses of different indexing techniques to offer guidance on selecting an appropriate indexing technique for a given setting, and to provide directions for future research on metric indexing.
翻译:随着社会进程的不断数字化,我们看到现有数据的迅速性,我们看到了可用数据的迅速性。这被称为大数据。在研究环境中,数据的三个方面在试图利用大数据创造价值时往往被视为挑战的主要根源:数量、速度和多样性。许多研究涉及数量或速度,而较少的研究则涉及多样性。计量空间是处理多样性的理想,因为它们可以容纳任何数据,只要这些数据能够满足三角间的不平等的距离概念即可。为了加快在度量空间的搜索,已经提出了计量数据指数化技术的收集。然而,现有的调查覆盖面有限,目前还存在全面的实证研究。我们对现有指数指数进行综合调查,以支持精确的相似性搜索:我们总结现有的分布、标定和验证方法,以支持相似性搜索;我们提供指数构建的时间和空间复杂性分析,只要这些数据能够满足三角间的不平等;我们提供对其查询处理业绩进行实证比较。在评价指数化业绩时,必须进行实证化研究,因为业绩可以高度依赖现有精准性和指数化技术,而全面的实证研究尚有待报告。我们对现有指数指数指数指数进行的全面调查,从而经常提供精确性分析。