Cross-device federated learning (FL) has been well-studied from algorithmic, system scalability, and training speed perspectives. Nonetheless, moving from centralized training to cross-device FL for millions or billions of devices presents many risks, including performance loss, developer inertia, poor user experience, and unexpected application failures. In addition, the corresponding infrastructure, development costs, and return on investment are difficult to estimate. In this paper, we present a device-cloud collaborative FL platform that integrates with an existing machine learning platform, providing tools to measure real-world constraints, assess infrastructure capabilities, evaluate model training performance, and estimate system resource requirements to responsibly bring FL into production. We also present a decision workflow that leverages the FL-integrated platform to comprehensively evaluate the trade-offs of cross-device FL and share our empirical evaluations of business-critical machine learning applications that impact hundreds of millions of users.
翻译:从算法、系统可扩缩性和培训速度的角度,对跨部门联合学习(FL)进行了深入的研究,然而,从集中培训转向为数百万或数十亿个装置交叉提供FL,带来了许多风险,包括性能损失、开发者惰性、用户经验差和意外应用失败。此外,相应的基础设施、开发成本和投资回报也难以估算。在本文中,我们提出了一个设备库协作FL平台,与现有的机器学习平台相结合,为衡量现实世界制约因素、评估基础设施能力、评价示范培训绩效和估算系统资源需求以负责任地使FL投入生产提供了工具。我们还提出了一个决策工作流程,利用FL综合平台全面评估交叉融资的利弊,并分享我们对影响数亿用户的对业务至关重要的机器学习应用程序的经验评价。</s>