The performance of machine learning algorithms heavily relies on the availability of a large amount of training data. However, in reality, data usually reside in distributed parties such as different institutions and may not be directly gathered and integrated due to various data policy constraints. As a result, some parties may suffer from insufficient data available for training machine learning models. In this paper, we propose a multi-party dual learning (MPDL) framework to alleviate the problem of limited data with poor quality in an isolated party. Since the knowledge sharing processes for multiple parties always emerge in dual forms, we show that dual learning is naturally suitable to handle the challenge of missing data, and explicitly exploits the probabilistic correlation and structural relationship between dual tasks to regularize the training process. We introduce a feature-oriented differential privacy with mathematical proof, in order to avoid possible privacy leakage of raw features in the dual inference process. The approach requires minimal modifications to the existing multi-party learning structure, and each party can build flexible and powerful models separately, whose accuracy is no less than non-distributed self-learning approaches. The MPDL framework achieves significant improvement compared with state-of-the-art multi-party learning methods, as we demonstrated through simulations on real-world datasets.
翻译:机器学习算法的运行在很大程度上依赖于大量培训数据的可用性,然而,在现实中,数据通常存在于不同机构等分布方中,由于各种数据政策的限制,数据可能不会直接收集和整合,因此,一些当事方可能因培训机器学习模式的数据不足而受害。在本文件中,我们提出一个多党双重学习框架,以缓解孤立方中质量差的有限数据问题。由于多个当事方的知识共享进程总是以双重形式出现,我们表明,双重学习自然适合处理数据缺失的挑战,并明确利用双重任务之间的概率性相关性和结构关系,以使培训进程正规化。我们引入了带有数学证明的以地貌为主的差别隐私,以避免双重推理过程中可能出现的隐私渗漏。这一方法要求对现有的多党学习结构进行最低限度的修改,而且每个当事方可以分别建立灵活而有力的模型,其准确性不亚于非分散的自学方法。MPDL框架与州模拟多党学习方法相比取得了显著改进。