Collaborative Machine Learning is a paradigm in the field of distributed machine learning, designed to address the challenges of data privacy, communication overhead, and model heterogeneity. There have been significant advancements in optimization and communication algorithm design and ML hardware that enables fair, efficient and secure collaborative ML training. However, less emphasis is put on collaborative ML infrastructure development. Developers and researchers often build server-client systems for a specific collaborative ML use case, which is not scalable and reusable. As the scale of collaborative ML grows, the need for a scalable, efficient, and ideally multi-tenant resource management system becomes more pressing. We propose a novel system, Propius, that can adapt to the heterogeneity of client machines, and efficiently manage and control the computation flow between ML jobs and edge resources in a scalable fashion. Propius is comprised of a control plane and a data plane. The control plane enables efficient resource sharing among multiple collaborative ML jobs and supports various resource sharing policies, while the data plane improves the scalability of collaborative ML model sharing and result collection. Evaluations show that Propius outperforms existing resource management techniques and frameworks in terms of resource utilization (up to $1.88\times$), throughput (up to $2.76$), and job completion time (up to $1.26\times$).
翻译:暂无翻译