Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models. In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162X more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging language models and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.
翻译:GNN研究的主要障碍之一是缺乏大规模灵活的数据集。目前GNN的公开数据集大多相对较小,这限制了GNNS对不可见数据的一般化能力。现有的大比例图形数据集提供了非常有限的标签数据。这使得很难确定GNN模型对不可见数据的准确性是否必然是由于原始培训数据不足或模型未能概括化。此外,用于培训GNNS的数据集需要提供灵活性,以便能够在培训GNN模式的同时对各种因素的影响进行彻底研究。在这项工作中,我们引入了GNNS的图表基准(IGB),这是一个研究数据集工具,开发者可以用来以高忠诚的方式对GNNS模型进行训练、检查和系统评估。IGB包括大尺寸的平级和混合的PNNDG图表,支持40 %以上的现有节点项目。比起最大的图表数据集,IMNNNNDD在公开的文本中,IGB数据库提供超过162x的精确性能,而G数据库则用于进行深度数据学习。IGB数据库的模型和GOI的高级数据库是多为G-G的精确度研究。</s>