Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning (ML) accelerators while delivering performance and energy efficiency on par with a monolithic large chip. However, ML compilers targeting MCMs need to solve complex optimization problems optimally and efficiently to achieve this high performance. One such problem is the multi-chip partitioning problem where compilers determine the optimal partitioning and placement of operations in tensor computation graphs on chiplets in MCMs. Partitioning ML graphs for MCMs is particularly hard as the search space grows exponentially with the number of chiplets available and the number of nodes in the neural network. Furthermore, the constraints imposed by the underlying hardware produce a search space where valid solutions are extremely sparse. In this paper, we present a strategy using a deep reinforcement learning (RL) framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver. Using the constraint solver ensures that RL encounters valid solutions in the sparse space frequently enough to converge with fewer samples as compared to non-learned strategies. The architectural choices we make for the policy network allow us to generalize across different ML graphs. Our evaluation of a production-scale model, BERT, on real hardware reveals that the partitioning generated using RL policy achieves 6.11% and 5.85% higher throughput than random search and simulated annealing. In addition, fine-tuning the pre-trained RL policy reduces the search time from 3 hours to only 9 minutes, while achieving the same throughput as training RL policy from scratch.
翻译:多芯- Modules (MCMM ) 降低机器学习( ML) 加速器的设计和制造成本,同时以单一的大型芯片来提供性能和节能。 然而,针对 MCM 的 ML 编译者需要以最佳和高效率的方式解决复杂的优化问题,才能取得如此高的性能。 其中一个问题就是多芯分割问题, 即编译者在MCM 的芯片上决定操作的最佳分隔和定位, 将机器学习( ML) 加速器的设计和制造成本降低。 当搜索空间随着可用芯片的数量和神经网络节点的数量的增多而急剧增长时, MCM 的 ML 调色调也特别困难。 此外, 基础硬件带来的限制产生了一个搜索空间, 有效的解决方案非常少。 在本文件中, 我们用一个深度强化学习( RL) 框架来释放一个可能无效的候选人配电配, 然后通过一个制约解答器来校正。 使用制约解算器确保 RL 在稀薄空间中遇到有效的解决方案, 与不易的样本相匹配, 与不及非学习的战略相比, 。 5- 我们的 RL 的建筑选择的 RL 将政策网络上, 我们的打印政策网络的模型将一个总的模型显示的 RL 的 RL 的模型显示的模型到整个的 RP 。