The data needed for machine learning (ML) model training, can reside in different separate sites often termed data silos. For data-intensive ML applications, data silos pose a major challenge: the integration and transformation of data demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the local sites, and a model has to be trained in a decentralized manner. In this work, we present a vision on how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. We analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight the new research opportunities from the aspects of systems, representations, factorized learning and federated learning.
翻译:机器学习(ML)模型培训所需的数据可以存放在不同不同的地点,通常称为数据筒仓。对于数据密集的 ML 应用程序,数据筒仓是一个重大挑战:数据集成和转换需要大量手工工作和计算资源。由于数据隐私和安全方面的限制,数据往往不能离开当地地点,而模型必须分散培训。在这项工作中,我们提出了如何将传统数据集成技术与现代机器学习要求联系起来的愿景。我们探索了利用数据集成流程获得的元数据提高ML模型效力和效率的可能性。我们分析了两个关于数据筒仓、特征增强和联合学习的共同使用案例。把数据集成和机器学习结合起来,我们从系统、演示、要素化学习和联合学习等方面强调新的研究机会。</s>