K-means is one of the most widely used clustering models in practice. Due to the problem of data isolation and the requirement for high model performance, how to jointly build practical and secure K-means for multiple parties has become an important topic for many applications in the industry. Existing work on this is mainly of two types. The first type has efficiency advantages, but information leakage raises potential privacy risks. The second type is provable secure but is inefficient and even helpless for the large-scale data sparsity scenario. In this paper, we propose a new framework for efficient sparsity-aware K-means with three characteristics. First, our framework is divided into a data-independent offline phase and a much faster online phase, and the offline phase allows to pre-compute almost all cryptographic operations. Second, we take advantage of the vectorization techniques in both online and offline phases. Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further. We conduct comprehensive experiments on three synthetic datasets and deploy our model in a real-world fraud detection task. Our experimental results show that, compared with the state-of-the-art solution, our model achieves competitive performance in terms of both running time and communication size, especially on sparse datasets.
翻译:K手段是实践中最广泛使用的群集模型之一。由于数据隔离问题和要求高模型性能的要求,如何共同为多个当事方建立实用和安全的K手段已成为该行业许多应用的重要话题。关于这一方面的现有工作主要有两种类型。第一类具有效率优势,但信息渗漏有潜在的隐私风险。第二类是可证实的安全,但对于大规模数据宽度假设则效率低甚至无助。在本文件中,我们提出了高效的Samersity-aware K手段的新框架,有三个特点。首先,我们的框架被分为一个数据依赖离线阶段和更快的在线阶段,而离线阶段允许预先计算几乎所有的加密操作。第二,我们利用在线和离线阶段的传导技术来提高潜在的隐私风险。第三,我们对数据宽度假设采用稀疏的矩阵倍增法,以进一步提高效率。我们在三个合成数据集上进行全面实验,并在现实世界欺诈侦查任务中部署我们的模型。首先,我们的框架分为一个数据依赖离线的离线阶段,并且是一个更快的在线阶段,而离线阶段允许预先计算几乎所有的加密操作。第二,我们在在线阶段中,与最有竞争力的通信,在模型上实现了,在最有竞争力的状态的进度上实现。