Reliable evaluation of AI systems remains a fundamental challenge when ground truth labels are unavailable, particularly for systems generating natural language outputs like AI chat and agent systems. Many of these AI agents and systems focus on entity-centric tasks. In enterprise contexts, organizations deploy AI systems for entity linking, data integration, and information retrieval where verification against gold standards is often infeasible due to proprietary data constraints. Academic deployments face similar challenges when evaluating AI systems on specialized datasets with ambiguous criteria. Conventional evaluation frameworks, rooted in supervised learning paradigms, fail in such scenarios where single correct answers cannot be defined. We introduce VB-Score, a variance-bounded evaluation framework for entity-centric AI systems that operates without ground truth by jointly measuring effectiveness and robustness. Given system inputs, VB-Score enumerates plausible interpretations through constraint relaxation and Monte Carlo sampling, assigning probabilities that reflect their likelihood. It then evaluates system outputs by their expected success across interpretations, penalized by variance to assess robustness of the system. We provide formal theoretical analysis establishing key properties including range, monotonicity, and stability along with concentration bounds for Monte Carlo estimation. Through case studies on AI systems with ambiguous inputs, we demonstrate that VB-Score reveals robustness differences hidden by conventional evaluation frameworks, offering a principled measurement framework for assessing AI system reliability in label-scarce domains.
翻译:在缺乏真实标签的情况下,对AI系统进行可靠评估仍是一个根本性挑战,尤其针对生成自然语言输出的系统,如AI对话与智能体系统。许多此类AI智能体与系统聚焦于实体中心化任务。在企业环境中,组织部署AI系统用于实体链接、数据集成和信息检索,由于专有数据限制,通常难以通过黄金标准进行验证。学术部署在评估基于模糊标准的专用数据集上的AI系统时面临类似挑战。植根于监督学习范式的传统评估框架在此类无法定义单一正确答案的场景中失效。我们提出VB-Score——一种用于实体中心化AI系统的方差有界评估框架,该框架通过联合度量效能与鲁棒性,在无需真实标签的条件下运行。给定系统输入,VB-Score通过约束松弛与蒙特卡洛采样枚举合理解释,并分配反映其可能性的概率值。随后,该框架通过系统输出在不同解释上的期望成功度进行评估,并引入方差惩罚以衡量系统鲁棒性。我们提供了形式化理论分析,确立了包括值域、单调性、稳定性在内的关键性质,以及蒙特卡洛估计的集中界。通过对具有模糊输入AI系统的案例研究,我们证明VB-Score能够揭示传统评估框架所掩盖的鲁棒性差异,为在标签稀缺领域评估AI系统可靠性提供了原则性度量框架。