Hardware compute power has been growing at an unprecedented rate in recent years. The utilization of such advancements plays a key role in producing better results in less time -- both in academia and industry. However, merging the existing hardware with the latest hardware within the same ecosystem poses a challenging task. One of the key challenges, in this case, is varying compute power. In this paper, we consider the training of deep neural networks on a distributed system of workers with varying compute power. A naive implementation of synchronous distributed training will result in the faster workers waiting for the slowest worker to complete processing. To mitigate this issue, we propose to dynamically adjust the data assigned for each worker during the training. We assign each worker a partition of total data proportional to its computing power. Our experiments show that dynamically adjusting the data partition helps to improve the utilization of the system and significantly reduces the time taken for training. Code is available at the repository: \url{https://github.com/vineeths96/Heterogeneous-Systems}.
翻译:近年来,硬件计算能力以前所未有的速度增长。在学术界和工业界,这种进步的利用对于在更短的时间内产生更好的结果起着关键的作用。然而,将现有硬件与同一生态系统内的最新硬件合并是一项具有挑战性的任务。在这种情况下,关键挑战之一是不同的计算能力。在本文中,我们考虑在不同的计算能力工人分布系统中对深神经网络进行培训。天真地实施同步分布式培训将导致工人更快地等待最慢的工人完成处理。为了缓解这一问题,我们提议动态地调整培训中分配给每个工人的数据。我们给每个工人分配一个与其计算能力成比例的总数据分割。我们的实验表明,动态地调整数据分割有助于改进对系统的利用,并大大减少培训时间。可在存储处查阅守则:https://github.com/vineeths96/hetergenels-Systems}。