In many real-world applications such as business planning and sensor data monitoring, one important, yet challenging, the task is to rank objects(e.g., products, documents, or spatial objects) based on their ranking scores and efficiently return those objects with the highest scores. In practice, due to the unreliability of data sources, many real-world objects often contain noises and are thus imprecise and uncertain. In this paper, we study the problem of probabilistic top-k dominating(PTD) query on such large-scale uncertain data in a distributed environment, which retrieves k uncertain objects from distributed uncertain databases(on multiple distributed servers), having the largest ranking scores with high confidences. In order to efficiently tackle the distributed PTD problem, we propose a MapReduce framework for processing distributed PTD queries over distributed uncertain databases. In this MapReduce framework, we design effective pruning strategies to filter out false alarms in the distributed setting, propose cost-model-based index distribution mechanisms over servers, and develop efficient distributed PTD query processing algorithms. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed distributed PTD approach on both real and synthetic data sets through various experimental settings.
翻译:在诸如商业规划和传感器数据监测等许多现实应用中,一个重要但又具有挑战性的重要现实应用领域,任务是根据排名评分对对象(如产品、文件或空间物体)进行排名,并有效地将得分最高的对象退回;实际上,由于数据来源不可靠,许多现实世界的物体往往含有噪音,因此是不准确和不确定的;在本文件中,我们研究分布环境中关于这种大规模不确定数据的概率性顶层支配(PTD)查询问题,从分布式不确定的数据库(多分布式服务器上)检索到最大得分的不确定对象,并具有高度信心;为了有效处理分布式PTD问题,我们提出了一个地图图解框架,用于处理分布式不确定的数据库中分布式PTD查询;在分布式数据库中,我们设计有效的调整战略,在分布式设置中过滤错误的警报,提出基于成本模型的服务器的指数分配机制,并开发高效分布式的PTD查询算法;为了有效处理分布式的PTD,通过各种实验和合成数据集,进行广泛的实验,我们拟议的分布式PTD方法的效率和有效性。