We present FOLIO, a human-annotated, open-domain, and logically complex and diverse dataset for reasoning in natural language (NL), equipped with first order logic (FOL) annotations. FOLIO consists of 1,435 examples (unique conclusions), each paired with one of 487 sets of premises which serve as rules to be used to deductively reason for the validity of each conclusion. The logical correctness of premises and conclusions is ensured by their parallel FOL annotations, which are automatically verified by our FOL inference engine. In addition to the main NL reasoning task, NL-FOL pairs in FOLIO automatically constitute a new NL-FOL translation dataset using FOL as the logical form. Our experiments on FOLIO systematically evaluate the FOL reasoning ability of supervised fine-tuning on medium-sized language models (BERT, RoBERTa) and few-shot prompting on large language models (GPT-NeoX, OPT, GPT-3, Codex). For NL-FOL translation, we experiment with GPT-3 and Codex. Our results show that one of the most capable Large Language Model (LLM) publicly available, GPT-3 davinci, achieves only slightly better than random results with few-shot prompting on a subset of FOLIO, and the model is especially bad at predicting the correct truth values for False and Unknown conclusions. Our dataset and code are available at https://github.com/Yale-LILY/FOLIO.
翻译:我们提出FOLIO,这是用自然语言进行推理的人类附加说明的、开放的、逻辑复杂和多样化的数据集(NL),配有第一顺序逻辑(FOLL)说明。FOLIO由1,435个实例(独有的结论)组成,每套有487套房舍配对,作为计算每项结论有效性的规则;房地和结论的逻辑正确性得到其平行的FOL说明(GPT-NeoX、OPT-3、GPT-3、codx)的自动验证。除了主要的NLL推理任务外,FOLIO的NL-FOL配对子自动构成一个新的NL-FOL翻译数据集,以FOL为逻辑形式。我们在FOLIOI的实验系统地评估FOL推理能力,监督对中等语言模型(BERT、ROBERTTT)的微调能力,对大语言模型(GPT-NOX、OPT-3、GFOL-3、OLFLLL)的微缩、O-O-OLILO-O-OLLU的精确的精确数据分析结果的模型,在最精确的模型上,我们最精确的精确的精确的模型中,其最精确的模型中,在最精确的精确的模型中,在最精确的精确的精确的精确的模型中可以得到。