The process of optimizing the latency of DNN operators with ML models and hardware-in-the-loop, called auto-tuning, has established itself as a pervasive method for the deployment of neural networks. From a search space of loop-optimizations, the candidate providing the best performance has to be selected. Performance of individual configurations is evaluated through hardware measurements. The combinatorial explosion of possible configurations, together with the cost of hardware evaluation makes exhaustive explorations of the search space infeasible in practice. Machine Learning methods, like random forests or reinforcement learning are used to aid in the selection of candidates for hardware evaluation. For general purpose hardware like x86 and GPGPU architectures impressive performance gains can be achieved, compared to hand-optimized libraries like cuDNN. The method is also useful in the space of hardware accelerators with less wide-spread adoption, where a high-performance library is not always available. However, hardware accelerators are often less flexible with respect to their programming which leads to operator configurations not executable on the hardware target. This work evaluates how these invalid configurations affect the auto-tuning process and its underlying performance prediction model for the VTA hardware. From these results, a validity-driven initialization method for AutoTVM is developed, only requiring 41.6% of the necessary hardware measurements to find the best solution, while improving search robustness.
翻译:优化 DNN 操作员与 ML 模型和 自动调试 的硬件运行器的延缓度进程, 称为自动调试, 已被确定为部署神经网络的普遍方法。 从循环优化搜索空间中, 必须选择提供最佳性能的候选人。 单个配置的性能通过硬件测量进行评估。 可能的配置的组合爆炸以及硬件评估成本使得搜索空间的探索在实践上变得不可行。 机器学习方法,如随机森林或强化学习,被用来帮助选择硬件评价的候选人。 对于诸如x86 和GPGPGPPUP 等通用硬件来说,可以实现令人印象深刻的业绩增益。 与像 cuDNN 这样的手工优化图书馆相比,该方法也可以在硬件加速器空间中发挥作用,而采用范围不那么广,而且并不总是有高性能图书馆。 然而, 硬件加速器在编程方面往往不那么灵活, 导致操作员配置无法在硬件模型上执行。 对于一般的硬件评估, 从这些无效的硬件测量方法的初始性能如何影响必要的硬化 。