基于大语言模型的粗粒度到细粒度开放集图节点分类 (Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models)

Developing open-set classification methods capable of classifying in-distribution (ID) data while detecting out-of-distribution (OOD) samples is essential for deploying graph neural networks (GNNs) in open-world scenarios. Existing methods typically treat all OOD samples as a single class, despite real-world applications, especially high-stake settings such as fraud detection and medical diagnosis, demanding deeper insights into OOD samples, including their probable labels. This raises a critical question: can OOD detection be extended to OOD classification without true label information? To address this question, we propose a Coarse-to-Fine open-set Classification (CFC) framework that leverages large language models (LLMs) for graph datasets. CFC consists of three key components: a coarse classifier that uses LLM prompts for OOD detection and outlier label generation, a GNN-based fine classifier trained with OOD samples identified by the coarse classifier for enhanced OOD detection and ID classification, and refined OOD classification achieved through LLM prompts and post-processed OOD labels. Unlike methods that rely on synthetic or auxiliary OOD samples, CFC employs semantic OOD instances that are genuinely out-of-distribution based on their inherent meaning, improving interpretability and practical utility. Experimental results show that CFC improves OOD detection by ten percent over state-of-the-art methods on graph and text domains and achieves up to seventy percent accuracy in OOD classification on graph datasets.

翻译：开发能够对分布内（ID）数据进行分类并检测分布外（OOD）样本的开放集分类方法，对于在图神经网络（GNNs）在开放世界场景中的部署至关重要。现有方法通常将所有OOD样本视为单一类别，然而在现实应用，特别是高风险场景如欺诈检测和医疗诊断中，需要对OOD样本（包括其可能标签）进行更深入的洞察。这引出了一个关键问题：在缺乏真实标签信息的情况下，OOD检测能否扩展为OOD分类？为解决该问题，我们提出了一种基于大语言模型（LLMs）的粗粒度到细粒度开放集分类（CFC）框架，适用于图数据集。CFC包含三个核心组件：利用LLM提示进行OOD检测和异常标签生成的粗粒度分类器；基于GNN的细粒度分类器，通过粗粒度分类器识别的OOD样本进行训练，以增强OOD检测和ID分类；以及通过LLM提示和后处理的OOD标签实现的精细化OOD分类。与依赖合成或辅助OOD样本的方法不同，CFC采用基于内在语义真正属于分布外的语义OOD实例，从而提升了可解释性和实际效用。实验结果表明，CFC在图和文本领域的OOD检测性能较现有最优方法提升百分之十，并在图数据集的OOD分类中达到百分之七十的准确率。