Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly, and can achieve real-time performance.
翻译:近年来,超分辨率处理取得了令人印象深刻的进展,然而,其实时推断要求不仅对模型设计,而且对芯片执行构成挑战。在本文件中,我们对嵌入式GPU装置实施了全堆SR加速框架。斯洛伐克模型中使用的特殊字典学习算法经过详细分析,并通过新颖词典选择性战略加快了速度。此外,对硬件编程架构和模型结构进行了分析,以指导计算内核的最佳设计,从而在资源制约下将推断延迟降到最低程度。用这些新颖技术,对深层字典学习式SR模型中的通信和计算瓶颈进行了完美处理。在边缘嵌入式NVIDIA NX和2080Ti的实验表明,我们的方法大大超越了NVIDIA TensorRT的状态,并且能够实现实时的性能。</s>