KG20C与KG20C-QA：面向链接预测与问答的学术知识图谱基准数据集 (KG20C & KG20C-QA: Scholarly Knowledge Graph Benchmarks for Link Prediction and Question Answering)

from arxiv, Extracted and extended from the first author's PhD thesis titled "Multi-Relational Embedding for Knowledge Graph Representation and Analysis"

In this paper, we present KG20C and KG20C-QA, two curated datasets for advancing question answering (QA) research on scholarly data. KG20C is a high-quality scholarly knowledge graph constructed from the Microsoft Academic Graph through targeted selection of venues, quality-based filtering, and schema definition. Although KG20C has been available online in non-peer-reviewed sources such as GitHub repository, this paper provides the first formal, peer-reviewed description of the dataset, including clear documentation of its construction and specifications. KG20C-QA is built upon KG20C to support QA tasks on scholarly data. We define a set of QA templates that convert graph triples into natural language question--answer pairs, producing a benchmark that can be used both with graph-based models such as knowledge graph embeddings and with text-based models such as large language models. We benchmark standard knowledge graph embedding methods on KG20C-QA, analyze performance across relation types, and provide reproducible evaluation protocols. By officially releasing these datasets with thorough documentation, we aim to contribute a reusable, extensible resource for the research community, enabling future work in QA, reasoning, and knowledge-driven applications in the scholarly domain. The full datasets will be released at https://github.com/tranhungnghiep/KG20C/ upon paper publication.

翻译：本文提出KG20C与KG20C-QA两个精选数据集，旨在推动学术数据上的问答研究。KG20C是一个高质量的学术知识图谱，其通过定向选择学术会议/期刊、基于质量的筛选以及模式定义，从微软学术图谱构建而成。尽管KG20C此前已在GitHub仓库等非同行评审平台公开，但本文首次提供了该数据集的正式同行评审描述，包括对其构建过程与规范的清晰说明。KG20C-QA基于KG20C构建，用于支持学术数据上的问答任务。我们定义了一组问答模板，将图谱三元组转化为自然语言的问题-答案对，从而构建出一个既适用于知识图谱嵌入等基于图谱的模型、也适用于大语言模型等基于文本的模型的基准。我们在KG20C-QA上对标准知识图谱嵌入方法进行了基准测试，分析了不同关系类型下的性能表现，并提供了可复现的评估流程。通过正式发布这些数据集并提供详尽文档，我们旨在为研究社区贡献一个可复用、可扩展的资源，以促进学术领域内问答、推理及知识驱动应用的未来工作。完整数据集将在论文发表后发布于https://github.com/tranhungnghiep/KG20C/。