真正的侦探:深分辨理据模型的质疑基准 (True Detective: A Challenging Benchmark for Deep Abductive Reasoning \\in Foundation Models) - 专知论文

会员服务 ·

0

Performer · MoDELS · 语言模型化 · 讲稿 · state-of-the-art ·

2022 年 12 月 20 日

True Detective: A Challenging Benchmark for Deep Abductive Reasoning \\in Foundation Models

翻译：真正的侦探:深分辨理据模型的质疑基准

Maksym Del,Mark Fishel

from arxiv, 4 pages, preprint

Large language models (LLMs) have demonstrated strong performance in zero-shot reasoning tasks, including abductive reasoning. This is reflected in their ability to perform well on current benchmarks in this area. However, to truly test the limits of LLMs in abductive reasoning, a more challenging benchmark is needed. In this paper, we present such a benchmark, consisting of 191 long-form mystery stories, each approximately 1200 words in length and presented in the form of detective puzzles. Each puzzle includes a multiple-choice question for evaluation sourced from the "5 Minute Mystery" platform. Our results show that state-of-the-art GPT models perform significantly worse than human solvers on this benchmark, with an accuracy of 28\% compared to 47\% for humans. This indicates that there is still a significant gap in the abductive reasoning abilities of LLMs and highlights the need for further research in this area. Our work provides a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.

翻译：大型语言模型(LLMs)在零点推理任务(包括绑架推理)中表现良好,这体现在它们有能力很好地执行目前这一领域的基准。然而,为了真正测试绑架推理中的LLMs的局限性,需要一个更具挑战性的基准。在本文中,我们提出了这样一个基准,由191个长方形的神秘故事组成,每1,200个单词的长度大约1,并以侦探拼图的形式提出。每个谜题都包括来自“5分钟神秘”平台的多选题评估问题。我们的结果显示,最先进的GPT模型在这个基准上的表现比人类解算算器差得多,精确度为28 ⁇,而人类为47 ⁇ 。这表明LMs的绑架推理能力仍然有很大差距,并强调需要在这一领域进行进一步研究。我们的工作为今后对语言模型的推理工作提供了一个具有挑战性的基准,有助于更好地了解LMs能力的局限性。

0

相关内容

Performer

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

基于高电子迁移率电极材料的高效钙钛矿太阳电池的研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于融合智能算法斜拉桥振动控制Benchmark问题的混合控制策略研究

国家自然科学基金

0+阅读 · 2013年12月31日

激光掩膜微细电解加工机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

设计合成Au@Cu2O表面等离子体光催化材料及催化性能研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

Ti-Al-N基纳米多层膜的高温相变及其机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米结构中热输运机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

渗流-应力耦合作用下深部岩石工程围岩的时效变形与细观失稳机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

掺杂稀土转换发光晶态材料的可控制备及在染料敏化太阳能电池中的应用

国家自然科学基金

0+阅读 · 2009年12月31日

超声振动辅助微细铣削加工技术及机理研究

国家自然科学基金

0+阅读 · 2008年12月31日

iQPP: A Benchmark for Image Query Performance Prediction

Arxiv

0+阅读 · 2023年2月20日

A Novel Collaborative Self-Supervised Learning Method for Radiomic Data

Arxiv

0+阅读 · 2023年2月20日

Unveiling Transformers with LEGO: a synthetic reasoning task

Arxiv

0+阅读 · 2023年2月17日

Commonsense Reasoning for Conversational AI: A Survey of the State of the Art

Arxiv

0+阅读 · 2023年2月15日

Towards Reasoning in Large Language Models: A Survey

Arxiv

34+阅读 · 2022年12月20日

Self-Supervised Learning for Recommender Systems: A Survey

Arxiv

12+阅读 · 2022年3月29日

Causal Inference Principles for Reasoning about Commonsense Causality

Arxiv

13+阅读 · 2022年1月31日

KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning

Arxiv

27+阅读 · 2021年1月21日

Text Detection and Recognition in the Wild: A Review

Arxiv

20+阅读 · 2020年6月8日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

VIP会员

文章信息

相关主题

语言模型化

state-of-the-art

相关VIP内容

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

操作系统智能体：基于多模态大模型（MLLM）的通用计算设备智能体综述

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

自主人工智能：未来战争是否将是自主化的？

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

iQPP: A Benchmark for Image Query Performance Prediction

Arxiv

0+阅读 · 2023年2月20日

A Novel Collaborative Self-Supervised Learning Method for Radiomic Data

Arxiv

0+阅读 · 2023年2月20日

Unveiling Transformers with LEGO: a synthetic reasoning task

Arxiv

0+阅读 · 2023年2月17日

Commonsense Reasoning for Conversational AI: A Survey of the State of the Art

Arxiv

0+阅读 · 2023年2月15日

Towards Reasoning in Large Language Models: A Survey

Arxiv

34+阅读 · 2022年12月20日

Self-Supervised Learning for Recommender Systems: A Survey

Arxiv

12+阅读 · 2022年3月29日

Causal Inference Principles for Reasoning about Commonsense Causality

Arxiv

13+阅读 · 2022年1月31日

KG-BART: Knowledge Graph-Augmented BART for Generative Commonsense Reasoning

Arxiv

27+阅读 · 2021年1月21日

Text Detection and Recognition in the Wild: A Review

Arxiv

20+阅读 · 2020年6月8日

Differentiable Reasoning on Large Knowledge Bases and Natural Language

Arxiv

12+阅读 · 2019年12月17日

相关基金

基于高电子迁移率电极材料的高效钙钛矿太阳电池的研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于融合智能算法斜拉桥振动控制Benchmark问题的混合控制策略研究

国家自然科学基金

0+阅读 · 2013年12月31日

激光掩膜微细电解加工机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

设计合成Au@Cu2O表面等离子体光催化材料及催化性能研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

Ti-Al-N基纳米多层膜的高温相变及其机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米结构中热输运机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

渗流-应力耦合作用下深部岩石工程围岩的时效变形与细观失稳机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

掺杂稀土转换发光晶态材料的可控制备及在染料敏化太阳能电池中的应用

国家自然科学基金

0+阅读 · 2009年12月31日

超声振动辅助微细铣削加工技术及机理研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员