Partitioning trees are efficient data structures for $k$-nearest neighbor search. Machine learning libraries commonly use a special type of partitioning trees called $k$d-trees to perform $k$-nn search. Unfortunately, $k$d-trees can be ineffective in high dimensions because they need more tree levels to decrease the vector quantization (VQ) error. Random projection trees rpTrees solve this scalability problem by using random directions to split the data. A collection of rpTrees is called rpForest. $k$-nn search in an rpForest is influenced by two factors: 1) the dispersion of points along the random direction and 2) the number of rpTrees in the rpForest. In this study, we investigate how these two factors affect the $k$-nn search with varying $k$ values and different datasets. We found that with larger number of trees, the dispersion of points has a very limited effect on the $k$-nn search. One should use the original rpTree algorithm by picking a random direction regardless of the dispersion of points.
翻译:分离树是用于近邻搜索的高效数据结构。 机器学习图书馆通常使用一种特殊类型的分隔树,叫做美元- 树来进行美元- 美元搜索。 不幸的是, 美元- 树在高维方面可能无效, 因为他们需要更多的树水平来减少矢量量化错误。 随机投影树 rpTrees 通过使用随机方向来分割数据, 解决了这个可缩放性问题。 收集的 rpTrees 被称为 rpForest 。 美元- nn 搜索受到两个因素的影响:(1) 随机方向上的点分布和(2) rpTrees 在rpForest 的数量。 在这次研究中, 我们调查这两个因素如何影响 $- n 的搜索, 以不同的 美元值和不同的数据集 。 我们发现, 由于树木数量较大, 点的分散对 $k$- nn 搜索效果非常有限。 一个应该使用原始的 rpTre 算法, 选择一个随机方向, 而不考虑点的分散 。</s>