We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers implicit task-parallel load balancing in addition to data-parallel load balancing, providing users the flexibility to balance between them to achieve optimal performance. Finally, Atos allows users to adapt to different use cases by controlling the kernel strategy and task-parallel granularity. We demonstrate that each of these controls is important in practice. We evaluate and analyze the performance of Atos vs. BSP on three applications: breadth-first search, PageRank, and graph coloring. Atos implementations achieve geomean speedups of 3.44x, 2.1x, and 2.77x and peak speedups of 12.8x, 3.2x, and 9.08x across three case studies, compared to a state-of-the-art BSP GPU implementation. Beyond simply quantifying the speedup, we extensively analyze the reasons behind each speedup. This deeper understanding allows us to derive general guidelines for how to select the optimal Atos configuration for different applications. Finally, our analysis provides insights for future dynamic scheduling framework designs.
翻译:我们提出一个特别适合动态非正常应用的任务单点GPU动态调度框架。与主要的散装同步平行框架相比,Atos通过支持任务单方配方和放松的依附性,暴露了额外的共性,这对货币瓶颈问题尤为重要。Atos除了提供数据单点负载平衡外,还提供隐含的任务单负负载平衡,为用户提供平衡它们之间的灵活性,以达到最佳性能。最后,Atos允许用户通过控制内核战略和任务平行颗粒度框架来适应不同的使用案例。我们表明,所有这些控制措施在实际中都很重要。我们评估并分析Atos v. BSP在三种应用上的性能表现:宽度一搜索、PageRank和图示色色。Atos的实施工作实现了3.44x、2.1x、2.77x和峰值超速率12.8x、3.2x和9.08x,这三种案例都允许用户通过控制内核战略和任务平行颗粒颗粒颗粒颗粒颗粒颗粒颗粒颗粒的颗粒。我们如何在最快速的定位上进行我们未来的分析。