Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in production for this purpose. State-of-the-art implementations, however, present a lack of efficiency for some commonly used network configurations. In this paper we propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations. Our experiments demonstrate that our proposal yields notable performance improvements in a range of common CNN forward propagation convolution configurations, with speedups of up to 2.29x with respect to the best implementation of convolution in cuDNN, hence covering a relevant region in currently existing approaches.
翻译:革命是建立在革命神经网络基础上的深层学习应用的核心操作。当前的GPU结构对于培训和部署深层CNN非常高效,因此,这些结构主要用于生产。然而,最先进的实施方式对一些常用网络配置来说缺乏效率。在本文件中,我们建议对支持联合访问的CNN推论实施基于GPU的革命操作,而无需事先进行数据转换。我们的实验表明,我们的提议在CNN的一系列通用的CNN远端传播革命配置中取得了显著的绩效改进,在最佳实施CuDNNN革命方面,速度达到2.29x,因此在目前的做法中涵盖了一个相关区域。