Hallucination--defined here as generating statements unsupported or contradicted by available evidence or conversational context--remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.
翻译:幻觉——此处定义为生成缺乏可用证据或对话上下文支持、或与之相矛盾的陈述——仍然是对话式人工智能系统在要求事实可靠性的场景中部署的主要障碍。现有评估指标要么评估孤立响应,要么将不可验证内容视为错误,限制了其在多轮对话中的应用。我们提出VISTA(序列轮次评估中的验证)框架,通过声明级验证与序列一致性追踪来评估对话的事实性。VISTA将每个助手轮次分解为原子事实声明,依据可信来源与对话历史进行验证,并对不可验证陈述(主观性、矛盾性、缺乏证据或弃权)进行分类。在八个大型语言模型和四个对话事实性基准(AIS、BEGIN、FAITHDIAL与FADE)上的实验表明,VISTA在幻觉检测方面显著优于FACTSCORE与LLM-as-Judge基线方法。人工评估证实VISTA的分解策略提升了标注者一致性,并揭示了现有基准中的不一致性。通过将事实性建模为对话的动态属性,VISTA为对话系统提供了一种更透明、更符合人类认知的真实性度量标准。