Field Programmable Gate Arrays generate algorithmic specific architectures that improve the code's FLOP per watt ratio. Such devices are re-gaining interest due to the rise of new tools that facilitate their programming, such as OmpSs. The computational fluid dynamics community is always investigating new architectures that can improve its algorithm's performance. Commonly, those algorithms have a low arithmetic intensity and only reach a small percentage of the peak performance. The sparse matrix-vector multiplication is one of the most time-consuming operations on unstructured simulations. The matrix's sparsity pattern determines the indirect memory accesses of the multiplying vector. This data path is hard to predict, making traditional implementations fail. In this work, we present an FPGA architecture that maximizes the vector's re-usability by introducing a cache-like architecture. The cache is implemented as a circular list that maintains the BRAM vector components while needed. Following this strategy, up to 16 times of acceleration is obtained compared to a naive implementation of the algorithm.
翻译:野外可编程门阵列生成特定算法结构, 改进代码的 FLOP / wat 比率。 这些设备正在重新获得兴趣, 原因是新工具的兴起, 方便了它们的编程, 如 OmpS 。 计算流体动态社区总是在调查能够改进其算法性能的新结构。 通常, 这些算法的算术强度低, 只达到峰值的一小部分。 稀疏的矩阵- 矢量乘法是非结构化模拟中最耗时的操作之一。 矩阵的宽度模式决定着乘数矢量的间接内存访问。 这个数据路径很难预测, 使传统的执行失败 。 在这项工作中, 我们提出了一个 FPGA 结构, 通过引入一个类似缓存的架构, 使矢量的再可用性最大化。 缓存作为循环列表, 以维持所需的 BRAM 矢量组件 。 在此策略下, 与天真地执行算法相比, 获得最多 16 次的加速度 。