Convolutional Neural Networks (CNNs) are widely used in deep learning applications, e.g. visual systems, robotics etc. However, existing software solutions are not efficient. Therefore, many hardware accelerators have been proposed optimizing performance, power and resource utilization of the implementation. Amongst existing solutions, Field Programmable Gate Array (FPGA) based architecture provides better cost-energy-performance trade-offs as well as scalability and minimizing development time. In this paper, we present a model-independent reconfigurable co-processing architecture to accelerate CNNs. Our architecture consists of parallel Multiply and Accumulate (MAC) units with caching techniques and interconnection networks to exploit maximum data parallelism. In contrast to existing solutions, we introduce limited precision 32 bit Q-format fixed point quantization for arithmetic representations and operations. As a result, our architecture achieved significant reduction in resource utilization with competitive accuracy. Furthermore, we developed an assembly-type microinstructions to access the co-processing fabric to manage layer-wise parallelism, thereby making re-use of limited resources. Finally, we have tested our architecture up to 9x9 kernel size on Xilinx Virtex 7 FPGA, achieving a throughput of up to 226.2 GOp/S for 3x3 kernel size.
翻译:在深层学习应用(例如视觉系统、机器人等)中广泛使用进化神经网络(CNNs),但是,现有的软件解决方案效率不高,因此,许多硬件加速器建议优化执行的性能、功率和资源利用;在现有的解决方案中,基于外地可编程门阵列(FPGA)的架构提供了更好的成本-能源-性能交换以及可缩放和最小化的发展时间。在本文中,我们提出了一个可建成模型的可调整共同处理结构,以加速CNN的同步。我们的架构由同时加固技术和互连网络的多重和累积(MAC)单位组成,以利用最大程度的数据平行。与现有的解决方案相比,我们采用了有限的精确度为32位位Q格式的定点对算表和操作进行定分计。结果是,我们的架构在资源利用方面以竞争的精确度大幅下降。此外,我们开发了一个组式微结构,以利用共同处理结构来管理分层平行关系,从而重新使用有限的资源。最后,我们测试了我们SIMFA的架构,在7号VISQSQSx上达到9号。