Hardware acceleration for dilated and transposed convolution enables real time execution of related tasks like segmentation, but current designs are specific for these convolutional types or suffer from complex control for reconfigurable designs. This paper presents a design that decomposes input or weight for dilated and transposed convolutions respectively to skip redundant computations and thus executes efficiently on existing dense CNN hardware as well. The proposed architecture can cut down 87.8\% of the cycle counts to achieve 8.2X speedup over a naive execution for the ENet case.
翻译:扩大和转换变速的硬件加速使得能够实时执行相关任务,如分解,但目前的设计是针对这些变速型的具体设计,或受到可重新配置的设计的复杂控制。本文提出了一种对膨胀和变换的变速器分别进行分解的输入或重量以跳过冗余计算,从而高效执行现有有线电视新闻网密集硬件的设计。 拟议的结构可以削减87.8 % 的周期计数, 以便在ENet案的天真的执行中实现8. 2x加速。