Graph neural networks (GNNs) have recently emerged as a promising learning paradigm in learning graph-structured data and have demonstrated wide success across various domains such as recommendation systems, social networks, and electronic design automation (EDA). Like other deep learning (DL) methods, GNNs are being deployed in sophisticated modern hardware systems, as well as dedicated accelerators. However, despite the popularity of GNNs and the recent efforts of bringing GNNs to hardware, the fault tolerance and resilience of GNNs has generally been overlooked. Inspired by the inherent algorithmic resilience of DL methods, this paper conducts, for the first time, a large-scale and empirical study of GNN resilience, aiming to understand the relationship between hardware faults and GNN accuracy. By developing a customized fault injection tool on top of PyTorch, we perform extensive fault injection experiments to various GNN models and application datasets. We observe that the error resilience of GNN models varies by orders of magnitude with respect to different models and application datasets. Further, we explore a low-cost error mitigation mechanism for GNN to enhance its resilience. This GNN resilience study aims to open up new directions and opportunities for future GNN accelerator design and architectural optimization.
翻译:最近,在学习图表结构数据方面,GNN成为了充满希望的学习模式,在学习图表结构数据方面,GNN成为了充满希望的学习模式,在建议系统、社交网络和电子设计自动化(EDA)等各个领域表现出了广泛的成功。与其他深层学习(DL)方法一样,GNN正在被部署在先进的现代硬件系统和专用加速器中。然而,尽管GNN的受欢迎程度以及最近将GNN调调调入硬件的努力,GNN的过错容忍度和复原力普遍被忽视。在DL方法固有的算法弹性的启发下,本文件首次对GNN的应变能力进行了大规模的经验性研究,目的是了解硬件缺陷与GNN的准确性之间的关系。通过在PyTorrch顶端开发一个定制的错误注入工具,我们对各种GNN模型和应用数据集进行了广泛的过错注入实验。我们发现,GNNN模式的错应力因不同模型和应用数据集的大小不同程度而不同。此外,我们探索GNNNN的低成本的减轻错误机制,以提高其弹性和将来的GNNN的蓝图设计机会,以打开G的G的GNNNNE。