以高维度为近邻取样 -- -- 谁是最公平的共和党? (Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of Them All?)

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points $S$ and a radius parameter $r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that, given any query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the inherent unfairness of NN data structures and shows the performance of our algorithms on real-world datasets.

翻译：相似性搜索是一个基本的算法原始, 在许多计算机科学学科中广泛使用。在一系列点上, 美元和半径参数 $>0 美元的情况下, 美元附近的邻居( 美元- NNN) 问题要求建立一个数据结构, 在任何问点下, 以美元计, 以美元计, 在距离内返回一个点, 以美元计, 以美元计, 以美元计。在本文中, 我们从个人公平和提供平等机会的角度来研究美元- 零点问题: 所有在查询的距离内, 美元范围内的点应该有相同的返回概率。在低维方面, 这个问题首先由Hu, Qiao 和 Tao (PODS, 2014) 研究。本地敏感性信息信息敏感( LSH), 理论上最强的类似性搜索方法, 并不能提供这种公平性保证。在本文中, 我们用基于LSH的算算法进行公平性分析, 效率不会受到重大损失。我们为公平 NNE问题的确切和近似变种变量提议若干有效的数据结构。我们的方法更一般地用于从精确的精确地取样中取样,, 也要求用一个精确的精确的精确的精确的精确的精确的精确的精确和精确性数据采集, 。