Multi-source data fusion, in which multiple data sources are jointly analyzed to obtain improved information, has considerable research attention. For the datasets of multiple medical institutions, data confidentiality and cross-institutional communication are critical. In such cases, data collaboration (DC) analysis by sharing dimensionality-reduced intermediate representations without iterative cross-institutional communications may be appropriate. Identifiability of the shared data is essential when analyzing data including personal information. In this study, the identifiability of the DC analysis is investigated. The results reveals that the shared intermediate representations are readily identifiable to the original data for supervised learning. This study then proposes a non-readily identifiable DC analysis only sharing non-readily identifiable data for multiple medical datasets including personal information. The proposed method solves identifiability concerns based on a random sample permutation, the concept of interpretable DC analysis, and usage of functions that cannot be reconstructed. In numerical experiments on medical datasets, the proposed method exhibits a non-readily identifiability while maintaining a high recognition performance of the conventional DC analysis. For a hospital dataset, the proposed method exhibits a nine percentage point improvement regarding the recognition performance over the local analysis that uses only local dataset.
翻译:多来源数据聚合,通过对多个数据源进行联合分析以获得更好的信息,具有相当大的研究关注。对于多个医疗机构的数据集,数据保密和跨机构通信至关重要。在这种情况下,数据协作(DC)分析,在不进行迭接跨机构通信的情况下,分享维度减少的中间代表器,进行数据协作(DC)分析。在分析包括个人信息在内的数据时,确定共享数据至关重要。在这项研究中,对DC分析的可识别性进行了调查。结果显示,共享中间代表器很容易识别到用于监督学习的原始数据中。结果显示,共享的中间代表器很容易识别到用于监督学习的原始数据。本研究随后建议,只有共享不易识别的DC分析器,才能为包括个人信息在内的多个医疗数据集共享不易识别的数据。在随机抽样调整、可解释的DC分析概念以及使用无法重建的功能的基础上,拟议方法解决了可识别性关注点。在医学数据集的数值实验中,拟议方法显示,共享的中间代表器不易识别性,同时保持对常规的DC分析的高度认知性。对于医院数据集,拟议方法仅用九个百分点进行局部数据分析。