评估机器常识的理论依据基准 (A Theoretically Grounded Benchmark for Evaluating Machine Commonsense)

Programming machines with commonsense reasoning (CSR) abilities is a longstanding challenge in the Artificial Intelligence community. Current CSR benchmarks use multiple-choice (and in relatively fewer cases, generative) question-answering instances to evaluate machine commonsense. Recent progress in transformer-based language representation models suggest that considerable progress has been made on existing benchmarks. However, although tens of CSR benchmarks currently exist, and are growing, it is not evident that the full suite of commonsense capabilities have been systematically evaluated. Furthermore, there are doubts about whether language models are 'fitting' to a benchmark dataset's training partition by picking up on subtle, but normatively irrelevant (at least for CSR), statistical features to achieve good performance on the testing partition. To address these challenges, we propose a benchmark called Theoretically-Grounded Commonsense Reasoning (TG-CSR) that is also based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense, such as space, time, and world states. TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs. The benchmark is also designed to be few-shot (and in the future, zero-shot), with only a few training and validation examples provided. This report discusses the structure and construction of the benchmark. Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks. Benchmark access and leaderboard: https://codalab.lisn.upsaclay.fr/competitions/3080 Benchmark website: https://usc-isi-i2.github.io/TGCSR/

翻译：具有常识推理(CSR)能力的变压器编程机(编程机)是人工智能界的长期挑战。目前的CSR基准使用多种选择(在相对较少的情况下,是基因化的)解答答问题实例来评价机器常识。基于变压器的语文代表模式最近取得的进展表明,在现有的基准方面已经取得了相当大的进展。虽然目前存在数十个CSR基准,而且正在不断增长,但尚不清楚的是,全套常识能力是否已经得到系统的评估。此外,对于语言模型是否“适合”基准数据集的培训分布,目前CSR基准采用微妙但规范上无关(至少对于CSR而言如此 ) 的统计特征来评价机器常识。为了应对这些挑战,我们提出了一个称为“理论上的常识辨别常识”模型(TG-CSR) 模型(TG-CSR),尽管如此,但并不明显,但用于评估常识(如空间、时间和世界语言模型。TG-CSR) 以初步的常识解析分类分类分类分类为基础,首先提出一些常识性类别,作为具有可操作性的标准理论, 基准:Goldsalbisalbalbalbalbus 和Balbildalbisalbalbild ex exbild