Cloud-based enterprise search services (e.g., Amazon Kendra) are enchanting to big data owners by providing them with convenient search solutions over their enterprise big datasets. However, individuals and businesses that deal with confidential big data (eg, credential documents) are reluctant to fully embrace such services, due to valid concerns about data privacy. Solutions based on client-side encryption have been explored to mitigate privacy concerns. Nonetheless, such solutions hinder data processing, specifically clustering, which is pivotal in dealing with different forms of big data. For instance, clustering is critical to limit the search space and perform real-time search operations on big datasets. To overcome the hindrance in clustering encrypted big data, we propose privacy-preserving clustering schemes for three forms of unstructured encrypted big datasets, namely static, semi-dynamic, and dynamic datasets. To preserve data privacy, the proposed clustering schemes function based on statistical characteristics of the data and determine (A) the suitable number of clusters and (B) appropriate content for each cluster. Experimental results obtained from evaluating the clustering schemes on three different datasets demonstrate between 30% to 60% improvement on the clusters' coherency compared to other clustering schemes for encrypted data. Employing the clustering schemes in a privacy-preserving enterprise search system decreases its search time by up to 78%, while increases the search accuracy by up to 35%.
翻译:以云为基础的企业搜索服务(例如亚马逊 Kendra)对大数据拥有者有希望,为大数据拥有者提供对企业大数据集的方便搜索解决方案。然而,由于对数据隐私的合理关切,处理机密大数据的个人和企业(例如,证明文件)不愿完全接受这类服务。基于客户端加密的解决方案已经探索过,以减轻隐私关切。然而,这些解决方案阻碍数据处理,特别是集群,这是处理不同形式大数据的关键。例如,集群对于限制搜索空间和在大数据集上进行实时搜索至关重要。为克服加密大数据组群的阻碍,我们建议对三种非结构化的加密大数据集(即静态、半动态和动态数据集)采用隐私保护集群计划。为了维护数据隐私,拟议基于数据统计特征的集群计划功能并确定(A)组群的适当数目和(B)每个组群的适当内容。从对三个不同数据集的集群方案进行评估后获得的实验结果显示在30-60 %之间对加密大数据集进行组合。我们提议对三种非结构加密的加密大数据集采用保密集群组合进行保密集群搜索计划,同时通过加密系统进行加密搜索增加35的搜索机制,从而降低机密性搜索。