Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the increasing computational cost makes them very challenging to meet real-time latency and accuracy requirements. Although DNN runtime latency is dictated by model property (e.g., architecture, operations), hardware property (e.g., utilization, throughput), and more importantly, the effective mapping between these two, many existing approaches focus only on optimizing model property such as FLOPS reduction and overlook the mismatch between DNN model and hardware properties. In this work, we show that the mismatch between the varied DNN computation workloads and GPU capacity can cause the idle GPU tail effect, leading to GPU under-utilization and low throughput. As a result, the FLOPs reduction cannot bring effective latency reduction, which causes sub-optimal accuracy versus latency trade-offs. Motivated by this, we propose a GPU runtime-aware DNN optimization methodology to eliminate such GPU tail effect adaptively on GPU platforms. Our methodology can be applied on top of existing SOTA DNN optimization approaches to achieve better latency and accuracy trade-offs. Experiments show 11%-27% latency reduction and 2.5%-4.0% accuracy improvement over several SOTA DNN pruning and NAS methods, respectively
翻译:尽管国家-艺术(SOTA) DNN(SOTA)的超水平性能,但计算成本的增加使得它们难以满足实时延迟和准确性要求。尽管DNN运行时间延迟取决于模型属性(如架构、操作)、硬件属性(如利用、吞吐量),更重要的是,这两个模型之间的有效绘图,许多现有方法仅侧重于优化模型属性,如减少FLOPS,忽视DNN模式和硬件属性之间的不匹配。在这项工作中,我们表明,不同的 DNNN计算工作量与GPU能力之间的不匹配,可导致闲置的GPU尾部效应,导致GPU利用不足和吞吐量低。结果,FLOPs的减少无法带来有效的延迟性减少,从而导致次优化准确性和不透明性交易。为此,我们提议了GPUTO-有时间觉悟DNN优化方法,以消除这种GNNW的尾部效应。我们的方法可以适用于现有的SOTA-DNER尾部尾部尾部尾部尾部尾部尾部尾部尾部的尾部效应效应效应效应,导致GPUPUPUPUPUPUPUPUPUPUPUP-P-P-S-PUPUPUPUP-P-S-S-S-S-S-SLABLMD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SOL-SLVOL-SLID-SLVOL-S-SLVGNT-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-