Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We extend the current state-of-the-art aggregators and propose an optimization-based subspace estimator by modeling pairwise distances as quadratic functions by utilizing the recently introduced Flag Median problem. The estimator in our loss function favors the pairs that preserve the norm of the difference vector. We theoretically show that our approach enhances the robustness of state-of-the-art byzantine resilient aggregators. Also, we evaluate our method with different tasks in a distributed setup with a parameter server architecture and show its communication efficiency while maintaining similar accuracy. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator
翻译:现代 ML 应用程序日益依赖复杂的深层学习模型和大型数据集。 培训最大模型所需的计算数量出现了指数增长。 因此, 为了计算和数据,这些模型不可避免地以分布方式在节点群集中以分布方式进行计算和数据培训, 并且在应用模型之前将其更新加以汇总。 但是, 分布式的设置容易发生单个节点、 组件和软件的拜占廷失败。 由于在这些设置中增加了数据扩增, 因此非常需要强大和高效的聚合系统。 我们扩展了当前最先进的节点聚合器, 并提议了一个基于优化的子空间估计器, 通过利用最近推出的旗式介质问题来模拟对称距离作为二次函数。 我们损失函数中的估计器有利于维护差异矢量规范的配对。 我们理论上地显示, 我们的方法可以增强这些设置的状态- 端点有耐力的耐冲力聚合器的聚合器的坚固性。 此外, 我们用分布式设置参数服务器结构的不同任务来评估我们的方法, 并显示其通信效率, 同时保持相似的精确性 。 在 httpgamb/ 上可公开获取的代码 。