Advances in federated learning (FL) algorithms,along with technologies like differential privacy and homomorphic encryption, have led to FL being increasingly adopted and used in many application domains. This increasing adoption has led to rapid growth in the number, size (number of participants/parties) and diversity (intermittent vs. active parties) of FL jobs. Many existing FL systems, based on centralized (often single) model aggregators are unable to scale to handle large FL jobs and adapt to parties' behavior. In this paper, we present a new scalable and adaptive architecture for FL aggregation. First, we demonstrate how traditional tree overlay based aggregation techniques (from P2P, publish-subscribe and stream processing research) can help FL aggregation scale, but are ineffective from a resource utilization and cost standpoint. Next, we present the design and implementation of AdaFed, which uses serverless/cloud functions to adaptively scale aggregation in a resource efficient and fault tolerant manner. We describe how AdaFed enables FL aggregation to be dynamically deployed only when necessary, elastically scaled to handle participant joins/leaves and is fault tolerant with minimal effort required on the (aggregation) programmer side. We also demonstrate that our prototype based on Ray scales to thousands of participants, and is able to achieve a >90% reduction in resource requirements and cost, with minimal impact on aggregation latency.
翻译:联合学习(FL)算法的进步,加上不同隐私和同质加密等技术,导致FL越来越多地被采用并在许多应用领域使用FL。这种越来越多的采用导致FL工作的数量、规模(参与者/缔约方的数目)和多样性(参与者/缔约方的数目)的迅速增长(间歇性相对于活跃方的数目)的迅速增长。许多现有FL系统基于中央(通常为单一)模式的模型聚合器,无法大规模处理大型FL工作,适应各方行为。在本文中,我们为FL聚合提出了一个新的可升级和适应性架构。首先,我们展示了传统树上重叠的集合技术(来自P2P、出版订阅和流处理研究)如何有助于FL的汇总规模、规模(参与者/缔约方的数目)和多样性(参与者规模的最小规模)的配置,我们介绍AdaFed的设计和实施,它使用没有服务器/库的功能来适应规模的集合,以高效和不宽容的方式进行。我们描述AdaFed如何使FL聚合只在必要时进行动态部署。我们展示的是,以弹性规模的90为规模的组合技术,在处理参与者的最小规模的进度上也显示我们以降低成本和规模的参与者的努力。