无真实标签条件下实体中心化AI系统的方差有界评估：理论与度量 (Variance-Bounded Evaluation of Entity-Centric AI Systems Without Ground Truth: Theory and Measurement)

Reliable evaluation of AI systems remains a fundamental challenge when ground truth labels are unavailable, particularly for systems generating natural language outputs like AI chat and agent systems. Many of these AI agents and systems focus on entity-centric tasks. In enterprise contexts, organizations deploy AI systems for entity linking, data integration, and information retrieval where verification against gold standards is often infeasible due to proprietary data constraints. Academic deployments face similar challenges when evaluating AI systems on specialized datasets with ambiguous criteria. Conventional evaluation frameworks, rooted in supervised learning paradigms, fail in such scenarios where single correct answers cannot be defined. We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems that operates without ground truth by jointly measuring effectiveness and robustness. Given system inputs, VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling, assigning probabilities that reflect their likelihood. It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system. We provide formal theoretical analysis establishing key properties including range, monotonicity, and stability along with concentration bounds for Monte Carlo estimation. Through case studies on AI systems with ambiguous inputs, we demonstrate that VB-Score reveals robustness differences hidden by conventional evaluation frameworks, offering a principled measurement framework for assessing AI system reliability in label-scarce domains.

翻译：在缺乏真实标签的情况下，对AI系统进行可靠评估仍是一个根本性挑战，尤其针对生成自然语言输出的系统，如AI对话与智能体系统。许多此类AI智能体与系统聚焦于实体中心化任务。在企业环境中，组织部署AI系统用于实体链接、数据集成和信息检索，由于专有数据限制，通常难以通过黄金标准进行验证。学术部署在评估基于模糊标准的专用数据集上的AI系统时面临类似挑战。植根于监督学习范式的传统评估框架在此类无法定义单一正确答案的场景中失效。我们提出VB-Score——一种用于实体中心化AI系统的方差有界评估框架，该框架通过联合度量效能与鲁棒性，在无需真实标签的条件下运行。给定系统输入，VB-Score通过约束松弛与蒙特卡洛采样枚举合理解释，并分配反映其可能性的概率值。随后，该框架通过系统输出在不同解释上的期望成功度进行评估，并引入方差惩罚以衡量系统鲁棒性。我们提供了形式化理论分析，确立了包括值域、单调性、稳定性在内的关键性质，以及蒙特卡洛估计的集中界。通过对具有模糊输入AI系统的案例研究，我们证明VB-Score能够揭示传统评估框架所掩盖的鲁棒性差异，为在标签稀缺领域评估AI系统可靠性提供了原则性度量框架。

相关内容

关注 7072

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日