With privacy as a motivation, Federated Learning (FL) is an increasingly used paradigm where learning takes place collectively on edge devices, each with a cache of user-generated training examples that remain resident on the local device. These on-device training examples are gathered in situ during the course of users' interactions with their devices, and thus are highly reflective of at least part of the inference data distribution. Yet a distribution shift may still exist; the on-device training examples may lack for some data inputs expected to be encountered at inference time. This paper proposes a way to mitigate this shift: selective usage of datacenter data, mixed in with FL. By mixing decentralized (federated) and centralized (datacenter) data, we can form an effective training data distribution that better matches the inference data distribution, resulting in more useful models while still meeting the private training data access constraints imposed by FL.
翻译:以隐私为动机,联邦学习联合会(FL)是一个日益使用的范例,即学习在边缘设备上集体进行,每个设备都有用户生成的培训范例,这些实例仍然留在本地设备中。这些在线培训范例是在用户与其设备互动过程中就地收集的,因此在很大程度上反映了至少部分推论数据分布。但分配变化可能仍然存在;在线培训范例可能缺乏一些预期在推论时会遇到的数据投入。本文建议了减轻这一转变的方法:有选择地使用数据中心数据,与FL混合。通过将分散(联邦)和集中(数据中心)数据混合起来,我们可以形成一种有效的培训数据分布,更好地与推论数据分布相匹配,从而形成更有用的模型,同时满足FL对私人培训数据访问的限制。