Convolutional neural networks (CNNs) have found many applications in tasks involving two-dimensional (2D) data, such as image classification and image processing. Therefore, 2D convolution layers have been heavily optimized on CPUs and GPUs. However, in many applications - for example genomics and speech recognition, the data can be one-dimensional (1D). Such applications can benefit from optimized 1D convolution layers. In this work, we introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We use the LIBXSMM library's batch-reduce General Matrix Multiplication (BRGEMM) kernel for FP32 and BFloat16 precision. We demonstrate that our implementation can achieve up to 80% efficiency on Intel Xeon Cascade Lake and Cooper Lake CPUs. Additionally, we show the generalization capability of our BRGEMM based approach by achieving high efficiency across a range of parameters. We consistently achieve higher efficiency than the 1D convolution layer with Intel oneDNN library backend for varying input tensor widths, filter widths, number of channels, filters, and dilation parameters. Finally, we demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets and achieve up to 6.86x speedup over the oneDNN library-based implementation on Cascade Lake CPUs. We also demonstrate the scaling with 16 sockets of Cascade/Cooper Lake CPUs and achieve significant speedup over eight V100 GPUs using a similar power envelop. In the end-to-end training, we get a speedup of 1.41x on Cascade Lake with FP32, 1.57x on Cooper Lake with FP32, and 2.27x on Cooper Lake with BFloat16 over eight V100 GPUs with FP32.
翻译:电传神经网络(CNNs)在涉及二维(2D)数据的任务中发现了许多应用,例如图像分类和图像处理。 因此, 在 CPU 和 GPU 上, 2D 变动层已大大优化。 但是, 在许多应用中, 例如基因组和语音识别, 数据可以是一维的。 这种应用可以受益于优化 1D 变动层。 在这项工作中, 我们引入了一种通用的 1D 变动层, 覆盖了范围广泛的参数。 它被优化为x86 CPU 结构, 特别是包含 Intel AVX-512 和 AVX-512 BFLat16 指令的建筑。 但是, 我们使用LIBXSM 图书馆的批发总变速放大器( BRGIMM) 来优化 1D 。 我们的变速电流和 CC- C- Cooc LC PUPL 能够达到80%的训练效率。 此外, 我们展示了我们的BRGEMM 的通用能力, 通过一个更高变压系统, 实现一个高效率的 CLMD 系统,, 的升级的系统在1 上, 我们的运行中, 我们的升级的升级的运行中, 也能够实现一个高效率, 在1 运行中, 在1 上, 实现一个电路的电路上实现一个电路的电路段。