Transpose convolution has shown prominence in many deep learning applications. However, transpose convolution layers are computationally intensive due to the increased feature map size due to adding zeros after each element in each row and column. Thus, convolution operation on the expanded input feature map leads to poor utilization of hardware resources. The main reason for unnecessary multiplication operations is zeros at predefined positions in the input feature map. We propose an algorithmic-level optimization technique for the effective transpose convolution implementation to solve these problems. Based on kernel activations, we segregated the original kernel into four sub-kernels. This scheme could reduce memory requirements and unnecessary multiplications. Our proposed method was $3.09 (3.02) \times$ faster computation using the Titan X GPU (Intel Dual Core CPU) with a flower dataset from the Kaggle website. Furthermore, the proposed optimization method can be generalized to existing devices without additional hardware requirements. A simple deep learning model containing one transpose convolution layer was used to evaluate the optimization method. It showed $2.2 \times$ faster training using the MNIST dataset with an Intel Dual-core CPU than the conventional implementation.
翻译:转换变换在很多深层学习应用中显示出了显著的变换。 然而, 转换变换层由于每行和列中每个元素后增加零, 地貌变换层由于每个元素在每行和列中增加零而增加地貌图尺寸, 计算得非常密集。 因此, 扩展输入特性图上的变换操作导致硬件资源利用率低。 不必要的倍增操作的主要原因是在输入特性图中预设的位置上为零。 我们提出一个算法级优化技术, 以便有效地转换变换实施解决这些问题。 在内核激活的基础上, 我们将原始内核分离成四个子内核。 这个方案可以减少内存要求和不必要的倍增。 我们提议的方法是3. 09 (3. 02)\ 乘以泰坦 XCPU (Intel 双核心 CPU) 和从 Kagle 网站建立花类数据集的快速计算 。 此外, 提议的优化方法可以在没有额外硬件要求的情况下推广到现有设备。 使用一个包含一个变换变层的简单深学习模型来评价优化方法。 它显示使用常规的硬质 CPO 执行速度为22\ 。 。 它显示比常规的MNIPPO 更快培训。