Big data mining is well known to be an important task for data science, because it can provide useful observations and new knowledge hidden in given large datasets. Proximity-based data analysis is particularly utilized in many real-life applications. In such analysis, the distances to k nearest neighbors are usually employed, thus its main bottleneck is derived from data retrieval. Much efforts have been made to improve the efficiency of these analyses. However, they still incur large costs, because they essentially need many data accesses. To avoid this issue, we propose a machine-learning technique that quickly and accurately estimates the k-NN distances (i.e., distances to the k nearest neighbors) of a given query. We train a fully connected neural network model and utilize pivots to achieve accurate estimation. Our model is designed to have useful advantages: it infers distances to the k-NNs at a time, its inference time is O(1) (no data accesses are incurred), but it keeps high accuracy. Our experimental results and case studies on real datasets demonstrate the efficiency and effectiveness of our solution.
翻译:众所周知,大数据挖掘是数据科学的一项重要任务,因为它可以提供有用的观测和隐藏在特定大数据集中的新的知识。基于近距离的数据分析在许多现实应用中特别得到利用。在这种分析中,通常使用与近邻的距离,因此其主要瓶颈来自数据检索。为提高这些分析的效率,作出了很大努力。然而,由于它们基本上需要许多数据存取,因此仍然需要大量费用。为了避免这一问题,我们提议一种机器学习技术,迅速准确地估计给定查询的 k-NN 距离(即与近邻的距离)。我们训练一个完全连接的神经网络模型,并利用电流来实现准确的估计。我们的模型旨在具有有用的优势:推算出与K-NN的距离,其推论时间是O(1)(没有数据存取),但它保持很高的准确性。我们关于真实数据集的实验结果和案例研究显示了我们解决方案的效率和效力。