Federated Learning (FL) is a privacy-focused machine learning paradigm that collaboratively trains models directly on edge devices. Simulated environments are crucial for large-scale FL research, allowing scientists to quickly test new ideas without acquiring millions of devices. However, current simulators cannot match the scale necessary to emulate production systems or push the boundaries of research in a time-efficient manner. This work proposes \emph{Pollen}, a novel resource-aware system for speeding up simulations. \emph{Pollen} addresses two limiting factors from previous systems: (a) communication inefficiency in pull-based client execution and (b) ignoring system inefficiencies from simulation-hardware diversity. \emph{Pollen} executes high-throughput FL simulations at scale by (a) using a push-based client placement system and (b) balancing clients across servers and their GPUs with a novel online machine learning model. Furthermore, \emph{Pollen}'s placement model reduces GPU idle time by up to 50\% by providing accurate training time predictions, allowing researchers to run extensive experiments sampling from millions of clients. Our experiments evaluate \pollen on four representative FL tasks. We compare \emph{Pollen} to ad-hoc FL frameworks, \emph{Flower}, \emph{Flute}, \emph{FedScale}, and \emph{Parrot}, and show experimental speed-ups of days or weeks.
翻译:暂无翻译