As Graph Neural Networks (GNNs) increase in popularity for scientific machine learning, their training and inference efficiency is becoming increasingly critical. Additionally, the deep learning field as a whole is trending towards wider and deeper networks, and ever increasing data sizes, to the point where hard hardware bottlenecks are often encountered. Emerging specialty hardware platforms provide an exciting solution to this problem. In this paper, we systematically profile and select low-level operations pertinent to GNNs for scientific computing implemented in the Pytorch Geometric software framework. These are then rigorously benchmarked on NVIDIA A100 GPUs for several various combinations of input values, including tensor sparsity. We then analyze these results for each operation. At a high level, we conclude that on NVIDIA systems: (1) confounding bottlenecks such as memory inefficiency often dominate runtime costs moreso than data sparsity alone, (2) native Pytorch operations are often as or more competitive than their Pytorch Geometric equivalents, especially at low to moderate levels of input data sparsity, and (3) many operations central to state-of-the-art GNN architectures have little to no optimization for sparsity. We hope that these results serve as a baseline for those developing these operations on specialized hardware and that our subsequent analysis helps to facilitate future software and hardware based optimizations of these operations and thus scalable GNN performance as a whole.
翻译:随着图像神经网络(GNNs)日益受到科学机器学习的欢迎,它们的培训和推论效率也变得越来越重要。此外,整个深层次的学习领域正趋向于扩大和深化网络,并日益扩大数据规模,以致硬件瓶颈往往会遇到。新兴专业硬件平台为这一问题提供了令人兴奋的解决办法。在本文中,我们系统地描述和选择与GNNs有关的低层次操作,用于在Pytorch 几何软件框架中执行的科学计算。然后,这些操作严格地以NVIDIA A100 GPUs为基准,用于若干组合的投入值,包括高温吸附。然后我们分析每个操作的这些结果。在高层次上,我们的结论是,在NVIDIA系统中:(1) 记忆效率等瓶颈往往比数据储量更能主导时间成本。(2) 本地的Pyturch操作往往比其Pytorch Gnock软件的等同水平高或更具有竞争力。 特别是低到中度的投入数据吸附度水平,以及(3) 许多对州和州级的软硬件操作中心操作,我们这些专业性GNNNPN的软的硬化分析结果,作为我们未来的硬化的硬化基础,这些基础的硬化的硬化的硬化软件,这些基础和基础的硬化的硬化的硬化的硬化的硬性分析是用来发展了我们后来的软的微的硬化的硬化的软。