Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization. To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid. Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0x over applications that use dynamic parallelism, 8.7x over applications that do not use dynamic parallelism, and 3.6x over applications that use dynamic parallelism with aggregation alone as proposed in prior work.
翻译:在 GPU 上的动态平行关系使 GPU 线索能够动态地启动其他 GPU 线索。 它在嵌入平行关系的应用中非常有用, 特别是在嵌入平行关系的数量不固定且无法事先预测的情况下。 但是, 先前的工作表明, 动态平行关系可能会在大量小网格启动时施加很高的性能处罚。 大量发射导致由于拥堵导致的发射延迟, 以及小网格大小导致硬件利用不足。 为了解决这个问题, 我们提议了一个编译框架, 优化在与嵌入平行关系的应用应用中使用动态平行的动态平行关系。 我们的评估显示, 我们的编译框架本身改进了动态平行关系应用的性能, 而不是以动态关系平行关系平行关系, 而不是以动态关系平行关系的方式, 而不是以动态关系平行关系进行。