可缩放的多电网基高层次科学数据对 GPUs 的构建 (Scalable Multigrid-based Hierarchical Scientific Data Refactoring on GPUs)

Jieyang Chen,Lipeng Wan,Xin Liang,Ben Whitney,Qing Liu,Qian Gong,David Pugmire,Nicholas Thompson,Jong Youl Choi,Matthew Wolf,Todd Munson,Ian Foster,Scott Klasky

from arxiv, arXiv admin note: text overlap with arXiv:2007.04457

Rapid growth in scientific data and a widening gap between computational speed and I/O bandwidth makes it increasingly infeasible to store and share all data produced by scientific simulations. Instead, we need methods for reducing data volumes: ideally, methods that can scale data volumes adaptively so as to enable negotiation of performance and fidelity tradeoffs in different situations. Multigrid-based hierarchical data representations hold promise as a solution to this problem, allowing for flexible conversion between different fidelities so that, for example, data can be created at high fidelity and then transferred or stored at lower fidelity via logically simple and mathematically sound operations. However, the effective use of such representations has been hindered until now by the relatively high costs of creating, accessing, reducing, and otherwise operating on such representations. We describe here highly optimized data refactoring kernels for GPU accelerators that enable efficient creation and manipulation of data in multigrid-based hierarchical forms. We demonstrate that our optimized design can achieve up to 264 TB/s aggregated data refactoring throughput -- 92% of theoretical peak -- on 1024 nodes of the Summit supercomputer. We showcase our optimized design by applying it to a large-scale scientific visualization workflow and the MGARD lossy compression software.

翻译：科学数据迅速增长,计算速度和I/O带宽之间差距日益扩大,因此越来越难以储存和分享科学模拟产生的所有数据。相反,我们需要减少数据数量的方法:理想的方法是,能够根据适应性调整数据数量的方法,以便能够在不同情况下谈判性能和忠诚取舍。以多电网为基础的等级数据表显示有希望作为解决这个问题的一种解决办法,允许在不同忠诚度和I/O带宽之间灵活转换,以便通过逻辑简单和数学健全的操作生成数据,然后转移或储存在较低忠诚度的数据。然而,迄今为止,由于创建、访问、减少或以其他方式在这种表示上操作的成本相对较高,这些表示的有效利用一直受到阻碍。我们在这里描述为GPU加速器提供高度优化的数据内核内核,以便能够在基于多电网的等级表上高效率地创建和操纵数据。我们的最佳设计可以达到264 TB/s综合数据通过数字(92%的理论峰值),在1024年峰会的顶点点上将我们最优化的数据重新配置。