CPU-based inference can be an alternative to off-chip accelerators, and vector architectures are a promising option due to their efficiency. However, the large design space of convolutional algorithms and hardware implementations makes it challenging to select the best options. This paper presents ongoing research into co-designing vector architectures for CPU-based CNN inference, focusing on the im2col+GEMM and Winograd kernels. Using the Gem5 simulator, we examine the impact of various hardware microarchitectural features on RISC-V Vector and ARM-SVE ISAs. We also study the impact of several BLIS-like algorithmic optimizations on im2col+GEMM. Our co-design study shows that longer vector lengths and larger caches can improve performance by 5x with our optimized CNN kernels, compared to a vector length of 512-bit and 1MB of L2 cache. For Winograd, we present a novel approach of inter-tile parallelization that exploits longer vector lengths and offers high memory reuse, resulting in up to 2.4x performance improvement for non-strided convolutional layers with 3x3 kernel size. Our study also shows that Winograd requires smaller cache sizes compared to im2col+GEMM.
翻译:基于CPU的推断可能是离芯加速器的一种替代方法,而病媒结构因其效率而是一个很有希望的选择。然而,由于革命算法和硬件实施的设计空间巨大,因此选择最佳选项具有挑战性。本文件介绍了对基于CPU的CNN推理共同设计矢量结构的研究,重点是im2col+GEMM和Winograd内核。我们使用 Gem5 模拟器,审查了各种硬件微结构特征对RIRC-VV矢量和ARM-SVE ISAs的影响。我们还研究了几处BLIS类算法优化对im2col+GEMM 最佳选项的影响。我们的共同设计研究表明,较长的矢量长度和更大的缓存可以提高5x的性能,而使用512比和1MB2缓存的矢量长度。对于Winograd,我们提出了一种新颖的跨平行平行化方法,利用更长的矢量矢量和ARMV-SV-SVIS ISAs。我们还研究了几类的算算算法优化的I-CLMIS 3级,从而将产生高水平的绩效,从而显示,在NEVIGEBER2 3级中进行高度上进行高度研究。