We introduce the CRASS (counterfactual reasoning assessment) data set and benchmark utilizing questionized counterfactual conditionals as a novel and powerful tool to evaluate large language models. We present the data set design and benchmark as well as the accompanying API that supports scoring against a crowd-validated human baseline. We test six state-of-the-art models against our benchmark. Our results show that it poses a valid challenge for these models and opens up considerable room for their improvement.
翻译:我们采用CRASS(反事实推理评估)数据集和基准,利用有疑问的反事实条件作为评估大型语言模型的新颖而有力的工具,我们提出数据集设计和基准以及相应的API,支持根据人群验证的人类基线进行评分,我们根据我们的基准测试了六个最先进的模型。我们的结果表明,这对这些模型构成了有效的挑战,并为改进这些模型开辟了相当大的空间。