Recently, data collaboration (DC) analysis has been developed for privacy-preserving integrated analysis across multiple institutions. DC analysis centralizes individually constructed dimensionality-reduced intermediate representations and realizes integrated analysis via collaboration representations without sharing the original data. To construct the collaboration representations, each institution generates and shares a shareable anchor dataset and centralizes its intermediate representation. Although, random anchor dataset functions well for DC analysis in general, using an anchor dataset whose distribution is close to that of the raw dataset is expected to improve the recognition performance, particularly for the interpretable DC analysis. Based on an extension of the synthetic minority over-sampling technique (SMOTE), this study proposes an anchor data construction technique to improve the recognition performance without increasing the risk of data leakage. Numerical results demonstrate the efficiency of the proposed SMOTE-based method over the existing anchor data constructions for artificial and real-world datasets. Specifically, the proposed method achieves 9 percentage point and 38 percentage point performance improvements regarding accuracy and essential feature selection, respectively, over existing methods for an income dataset. The proposed method provides another use of SMOTE not for imbalanced data classifications but for a key technology of privacy-preserving integrated analysis.
翻译:最近,为在多个机构进行隐私保护综合分析,发展了数据协作(DC)分析。DC分析集中了个体构建的维度减少中间表示,并通过协作代表实现综合分析,而没有分享原始数据。为构建协作代表,每个机构生成和共享一个共享的锚数据集,并集中其中间代表。虽然随机锚数据集对一般的DC分析作用良好,但使用一个与原始数据集分布接近的锚数据集,预期其分布将提高识别性能,特别是可解释的DC分析。根据合成少数群体过量采样技术(SMOTE)的扩展,本研究提出一个锚基数据构建技术,以提高识别性,同时又不增加数据泄漏的风险。数字结果显示基于SMOTE的拟议方法对现有人为和真实世界数据集的锁定数据构建的效率。具体地说,拟议方法在准确性和基本特征选择方面分别达到9个百分点和38个百分点。根据合成少数群体过量采样技术(SMOTE)的扩展,提议采用固定数据构建方法来提高识别性,但不会增加数据泄漏风险。数字结果显示基于SMOTE的拟议方法对现有关键数据分类的保密性进行了另一种应用。