分布式机器学习研究将具有大规模数据量和计算量的任务分布式地部署到多台机器上,其核心思想在于“分而治之”,有效提高了大规模数据计算的速度并节省了开销。

VIP内容

ML规模化常常被低估。在多台机器上训练一个ML模型(最初是针对单个CPU/GPU实现的)到底需要什么?一些痛点是: (1) 需要编写许多新代码行来将代码转换为分布式版本; (2)需要大量调整代码以满足系统/统计性能,这是模型开发的附加过程; (3)决定使用哪些/多少硬件资源来训练和部署模型; (4) 从组织的角度出发,在许多用户和作业之间实现资源共享自动化,以满足用户的需求,同时使资源利用率最大化,成本最小化。

在本教程中,我们将介绍自动化分布式ML基础设施的改进技术。本教程涵盖了对ML并行化至关重要的三个领域: (1)对并行ML构建块进行编组和标准化; (2) ML并行表示和软件框架; (3)自动ML并行化的算法和系统,以及在共享集群上ML作业的资源分配。通过揭示ML程序的独特特征,并通过剖析成功案例来揭示如何利用它们,我们为ML研究人员和实践者提供了进一步塑造和发展SysML领域的机会。

听众应该熟悉ML和DL的基础知识。了解TensorFlow、PyTorch和分布式ML技术也有帮助,但不是必需的。

https://sites.google.com/view/aaai-2021-tutorial-ah9/home

成为VIP会员查看完整内容
0
20

最新内容

Federated Learning (FL) is a novel distributed machine learning which allows thousands of edge devices to train model locally without uploading data concentrically to the server. But since real federated settings are resource-constrained, FL is encountered with systems heterogeneity which causes a lot of stragglers directly and then leads to significantly accuracy reduction indirectly. To solve the problems caused by systems heterogeneity, we introduce a novel self-adaptive federated framework FedSAE which adjusts the training task of devices automatically and selects participants actively to alleviate the performance degradation. In this work, we 1) propose FedSAE which leverages the complete information of devices' historical training tasks to predict the affordable training workloads for each device. In this way, FedSAE can estimate the reliability of each device and self-adaptively adjust the amount of training load per client in each round. 2) combine our framework with Active Learning to self-adaptively select participants. Then the framework accelerates the convergence of the global model. In our framework, the server evaluates devices' value of training based on their training loss. Then the server selects those clients with bigger value for the global model to reduce communication overhead. The experimental result indicates that in a highly heterogeneous system, FedSAE converges faster than FedAvg, the vanilla FL framework. Furthermore, FedSAE outperforms than FedAvg on several federated datasets - FedSAE improves test accuracy by 26.7% and reduces stragglers by 90.3% on average.

0
0
下载
预览

最新论文

Federated Learning (FL) is a novel distributed machine learning which allows thousands of edge devices to train model locally without uploading data concentrically to the server. But since real federated settings are resource-constrained, FL is encountered with systems heterogeneity which causes a lot of stragglers directly and then leads to significantly accuracy reduction indirectly. To solve the problems caused by systems heterogeneity, we introduce a novel self-adaptive federated framework FedSAE which adjusts the training task of devices automatically and selects participants actively to alleviate the performance degradation. In this work, we 1) propose FedSAE which leverages the complete information of devices' historical training tasks to predict the affordable training workloads for each device. In this way, FedSAE can estimate the reliability of each device and self-adaptively adjust the amount of training load per client in each round. 2) combine our framework with Active Learning to self-adaptively select participants. Then the framework accelerates the convergence of the global model. In our framework, the server evaluates devices' value of training based on their training loss. Then the server selects those clients with bigger value for the global model to reduce communication overhead. The experimental result indicates that in a highly heterogeneous system, FedSAE converges faster than FedAvg, the vanilla FL framework. Furthermore, FedSAE outperforms than FedAvg on several federated datasets - FedSAE improves test accuracy by 26.7% and reduces stragglers by 90.3% on average.

0
0
下载
预览
父主题
Top