Dedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the $X$-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The $X$-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing. Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves $10\times$ speedup over a state-of-the-art GPU implementation and up to $4.65\times$ compared to CPU. In addition, we introduce a memory-restricted $X$-Drop algorithm that reduces memory footprint by $55\times$ and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by $3.6\times$.
翻译:专用加速器硬件对于处理基于AI的工作负载变得不可或缺,从而催生出新型加速器架构。此外,由于内存架构和并行性方面的根本差异,使得这些加速器成为科学计算的目标。序列比对问题在生物信息学中是基础性问题;我们在Graphcore Intelligence Processor Unit (IPU)加速器上实现了$X$-Drop算法,这是一种用于成对比对的启发式算法,用于减少搜索空间。$X$-Drop算法具有不规则的计算模式,这使得由于负载平衡而难以加速。在这里,我们介绍了一种基于图的分区和基于队列的批处理系统,以改善负载平衡。我们的实现比最先进的GPU实现快$10\times$,与CPU相比高达$4.65\times$。此外,我们引入了一种内存限制的$X$-Drop算法,将内存占用降低$55\times$,并有效地使用了IPU的有限低延迟SRAM。这种优化进一步提高了强扩展性能达到$3.6\times$。