Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. In this paper, we propose an end-to-end learning framework that couples the partitioning (one key step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the key limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given partition of the data space, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. Without loss of generality, our unsupervised partitioning approach is shown as a promising alternative to many widely used clustering methods like K-means clustering and DBSCAN.
翻译:在高维空间,近距离近邻搜索(ANNS)对于处理大量数据的许多现实生活应用(例如电子商务、网络、多媒体等)至关重要。在本文件中,我们提议了一个端对端学习框架,即使用自定义损失函数将分割(ANNS的关键步骤)和学习搜索步骤结合起来(ANNS的关键步骤)和学习搜索步骤结合起来。我们提议的解决方案的一个主要优点是,它不需要任何昂贵的预处理数据集,这是最先进的方法的关键限制之一。我们通过开发一个多目标的定制损失功能,实现上边缘,不需要用地面的真相标签来量化特定数据空间分割的质量,使其完全不受监督。我们还提议一种组合技术,通过对损失函数增加不同的输入权重来训练各种模型来提高搜索质量。关于该数据集的几项标准基准,这是最先进的方法之一。我们的方法胜过最先进的空间分割参数,而采用不具有希望性的基级的模型,而采用不使用更短的基级的基级分组方法则在不使用更短的模型上展示一种不具有希望的模型。