Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, Cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, Cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for Cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.
翻译:对于许多数据分析技术来说,相似性搜索是一个根本问题。许多高效的搜索技术依赖于三角形的量度不平等,这使搜索空间中基于中转距离的间隔线进行部分切割。最近,科萨相似性已成为标准的欧clidean 度量的流行替代选择,特别是在文本数据和神经网络嵌入方面。不幸的是,科萨相似性不是衡量标准,不能满足标准的三角形不平等。相反,科辛的许多搜索技术依赖于近似技术,如对地点敏感的散列。在本文中,我们为科萨相似性得出了一个三角形不平等,适合与许多标准搜索结构(如VP-tree、Cover-tree和M-Tree)进行高效的相似性搜索;显示这一界限很紧,讨论其快速近似性。我们希望,这能激发关于加速对焦素相似性精确相似性搜索的新研究,以及现有远程度量度测量工作之外的其他可能相似性措施。