We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .
翻译:我们推出FACTS排行榜,这是一个在线排行榜套件及配套的基准测试集,用于全面评估语言模型在不同场景下生成事实准确文本的能力。该套件通过聚合模型在四个独立子排行榜上的表现,提供对事实准确性的整体衡量:(1) FACTS多模态,衡量模型对基于图像问题的回答的事实准确性;(2) FACTS参数化,通过回答基于内部参数的闭卷事实性问题,评估模型的世界知识;(3) FACTS搜索,评估模型在信息检索场景中的事实准确性,其中模型必须使用搜索API;(4) FACTS基于文档(v2),评估长篇回答是否基于提供的文档进行论证,并采用了显著改进的评判模型。每个子排行榜均采用自动化评判模型对模型回答进行评分,最终套件得分为四个组成部分的平均值,旨在提供对模型整体事实准确性的稳健且平衡的评估。FACTS排行榜套件将持续维护,包含公开和私有数据分割,以允许外部参与同时保障其完整性。该套件可通过 https://www.kaggle.com/benchmarks/google/facts 访问。