Despite the increasing adoption of Field-Programmable Gate Arrays (FPGAs) in compute clouds, there remains a significant gap in programming tools and abstractions which can leverage network-connected, cloud-scale, multi-die FPGAs to generate accelerators with high frequency and throughput. To this end, we propose TAPA-CS, a task-parallel dataflow programming framework which automatically partitions and compiles a large design across a cluster of FPGAs with no additional user effort while achieving high frequency and throughput. TAPA-CS has three main contributions. First, it is an open-source framework which allows users to leverage virtually "unlimited" accelerator fabric, high-bandwidth memory (HBM), and on-chip memory, by abstracting away the underlying hardware. This reduces the user's programming burden to a logical one, enabling software developers and researchers with limited FPGA domain knowledge to deploy larger designs than possible earlier. Second, given as input a large design, TAPA-CS automatically partitions the design to map to multiple FPGAs, while ensuring congestion control, resource balancing, and overlapping of communication and computation. Third, TAPA-CS couples coarse-grained floorplanning with automated interconnect pipelining at the inter- and intra-FPGA levels to ensure high frequency. We have tested TAPA-CS on our multi-FPGA testbed where the FPGAs communicate through a high-speed 100Gbps Ethernet infrastructure. We have evaluated the performance and scalability of our tool on designs, including systolic-array based convolutional neural networks (CNNs), graph processing workloads such as page rank, stencil applications like the Dilate kernel, and K-nearest neighbors (KNN). TAPA-CS has the potential to accelerate development of increasingly complex and large designs on the low power and reconfigurable FPGAs.
翻译:暂无翻译