Nowadays, gathering high-quality training data from multiple data sources with privacy preservation is a crucial challenge to training high-performance machine learning models. The potential solutions could break the barriers among isolated data corpus, and consequently enlarge the range of data available for processing. To this end, both academic researchers and industrial vendors are recently strongly motivated to propose two main-stream folders of solutions mainly based on software constructions: 1) Secure Multi-party Learning (MPL for short); and 2) Federated Learning (FL for short). The above two technical folders have their advantages and limitations when we evaluate them according to the following five criteria: security, efficiency, data distribution, the accuracy of trained models, and application scenarios. Motivated to demonstrate the research progress and discuss the insights on the future directions, we thoroughly investigate these protocols and frameworks of both MPL and FL. At first, we define the problem of Training machine learning Models over Multiple data sources with Privacy Preservation (TMMPP for short). Then, we compare the recent studies of TMMPP from the aspects of the technical routes, the number of parties supported, data partitioning, threat model, and machine learning models supported, to show their advantages and limitations. Next, we investigate and evaluate five popular FL platforms. Finally, we discuss the potential directions to resolve the problem of TMMPP in the future.
翻译:目前,从多种数据来源收集高质量培训数据,保护隐私,是培训高性能机器学习模式的关键挑战。潜在解决方案可以打破孤立的数据集之间的障碍,从而扩大可供处理的数据范围。为此,学术研究人员和工业供应商最近强烈地提出两个主要解决方案文件夹,主要基于软件结构:1) 安全多党学习(短)和2) 联邦学习(短) 。以上两个技术文件夹在根据以下五个标准评估它们时具有优势和局限性:安全、效率、数据分配、经过培训的模式的准确性和应用设想。为了展示研究进展并讨论关于未来方向的见解,我们积极彻底调查了这些协议以及MPL和FL两个协议的框架。首先,我们界定了培训机器学习模型在多种数据源和隐私保护(短)方面的问题。然后,我们比较了TMPP最近的研究与技术路径的各个方面、所支持的当事方数量、数据分割、威胁模型和机器学习模型的准确性能。我们最后要调查其优势和潜力,然后讨论MP平台的未来问题。</s>