近邻搜索不受监督的空间分割区 (Unsupervised Space Partitioning for Nearest Neighbor Search)

Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. In this paper, we propose an end-to-end learning framework that couples the partitioning (one key step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the key limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given partition of the data space, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. Without loss of generality, our unsupervised partitioning approach is shown as a promising alternative to many widely used clustering methods like K-means clustering and DBSCAN.

翻译：在高维空间,近距离近邻搜索(ANNS)对于处理大量数据的许多现实生活应用(例如电子商务、网络、多媒体等)至关重要。在本文件中,我们提议了一个端对端学习框架,即使用自定义损失函数将分割(ANNS的关键步骤)和学习搜索步骤结合起来(ANNS的关键步骤)和学习搜索步骤结合起来。我们提议的解决方案的一个主要优点是,它不需要任何昂贵的预处理数据集,这是最先进的方法的关键限制之一。我们通过开发一个多目标的定制损失功能,实现上边缘,不需要用地面的真相标签来量化特定数据空间分割的质量,使其完全不受监督。我们还提议一种组合技术,通过对损失函数增加不同的输入权重来训练各种模型来提高搜索质量。关于该数据集的几项标准基准,这是最先进的方法之一。我们的方法胜过最先进的空间分割参数,而采用不具有希望性的基级的模型,而采用不使用更短的基级的基级分组方法则在不使用更短的模型上展示一种不具有希望的模型。

相关内容

损失函数（机器学习）

关注 10

损失函数，在AI中亦称呼距离函数，度量函数。此处的距离代表的是抽象性的，代表真实数据与预测数据之间的误差。损失函数（loss function）是用来估量你模型的预测值f(x)与真实值Y的不一致程度，它是一个非负实值函数,通常使用L(Y, f(x))来表示，损失函数越小，模型的鲁棒性就越好。损失函数是经验风险函数的核心部分，也是结构风险函数重要组成部分。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

44+阅读 · 2020年12月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日