Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.
翻译:现代视觉语言模型(VLMs)在预测准确性方面表现卓越,但难以解释决策的‘原因’,尤其在面对分布外数据时经常产生事实幻觉。神经符号框架通过将黑盒感知与可解释的符号推理相结合来解决此问题,但现有方法仅从任务标签中提取符号,导致其与底层视觉数据的关联较弱。本文提出一种多智能体系统——Concept-RuleNet,在保持透明推理的同时重建视觉基础。具体而言,多模态概念生成器首先直接从训练图像的代表性子集中挖掘判别性视觉概念。随后,这些视觉概念被用于条件化符号发现,将生成过程锚定于真实图像统计量以缓解标签偏差。接着,由大型语言模型推理智能体将符号组合为可执行的一阶规则——生成可解释的神经符号规则。最后在推理阶段,视觉验证智能体量化每个符号的存在程度,并与黑盒神经模型的输出协同触发规则执行,形成具有显式推理路径的预测。在五个基准测试(包括两项挑战性医学影像任务和三个代表性不足的自然图像数据集)上的实验表明,本系统将最先进的神经符号基线平均提升5%,同时将规则中幻觉符号的出现率降低高达50%。