True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4 - 专知论文

会员服务 ·

0

GPT-4 · 语言模型化 · GPT-3 · MoDELS · state-of-the-art ·

2023 年 6 月 1 日

True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4

翻译：暂无翻译

Maksym Del,Mark Fishel

from arxiv, 5 pages, to appear at *SEM

Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities, which is reflected in their performance on the current test tasks. This calls for a more challenging benchmark requiring highly advanced reasoning ability to be solved. In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles. Puzzles are sourced from the "5 Minute Mystery" platform and include a multiple-choice question for evaluation. Only 47% of humans solve a puzzle successfully on average, while the best human solvers achieve over 80% success rate. We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles. This indicates that there is still a significant gap in the deep reasoning abilities of LLMs and humans and highlights the need for further research in this area. Our work introduces a challenging benchmark for future studies on reasoning in language models and contributes to a better understanding of the limits of LLMs' abilities.

翻译：暂无翻译

0

相关内容

GPT-4

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

325+阅读 · 2020年11月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

RAO3(MO)k氧化物自然超晶格界面调控及其对热电传输性能优化的研究

国家自然科学基金

0+阅读 · 2015年12月31日

多形态Au、Ag纳米颗粒自组装体修饰磁负载TiO2纳米纤维的合成与性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

含随机时滞、分布式时滞的分数阶非线性系统鲁棒稳定性及控制

国家自然科学基金

0+阅读 · 2013年12月31日

深能级陷阱对CdZnTe晶体电学性能及辐射探测器性能的影响

国家自然科学基金

0+阅读 · 2012年12月31日

TCO/Cu界面的润湿性及其对铜基电触头材料致密化及性能的影响

国家自然科学基金

0+阅读 · 2012年12月31日

菊糖果糖基转移酶的结构功能解析与分子改造研究

国家自然科学基金

0+阅读 · 2012年12月31日

松香改性壳聚糖阳离子表面活性剂合成及其构效关系

国家自然科学基金

0+阅读 · 2011年12月31日

钙钛矿型多铁性异质结的界面调控磁电耦合效应研究

国家自然科学基金

0+阅读 · 2011年12月31日

功能化碳化硅基复合材料的界面调控与机理研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于Tm,Ho晶体的种子光注入锁定可连续调谐2μ#21333;频激光器

国家自然科学基金

0+阅读 · 2008年12月31日

Aligning Large Language Models with Human: A Survey

Arxiv

1+阅读 · 2023年7月24日

Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation of rPPG

Arxiv

0+阅读 · 2023年7月24日

Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

Arxiv

0+阅读 · 2023年7月24日

Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review

Arxiv

0+阅读 · 2023年7月22日

Code Detection for Hardware Acceleration Using Large Language Models

Arxiv

0+阅读 · 2023年7月19日

Towards Reasoning in Large Language Models: A Survey

Arxiv

34+阅读 · 2022年12月20日

Deep Generative Models on 3D Representations: A Survey

Arxiv

15+阅读 · 2022年10月27日

Transformers in Medical Imaging: A Survey

Arxiv

15+阅读 · 2022年1月24日

Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy

Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy

Arxiv

42+阅读 · 2020年12月21日

Generative Adversarial Networks: A Survey and Taxonomy

Generative Adversarial Networks: A Survey and Taxonomy

Arxiv

14+阅读 · 2019年6月4日

VIP会员

文章信息

相关主题

语言模型化

state-of-the-art

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

最新《Transformers模型》教程，64页ppt

最新《Transformers模型》教程，64页ppt

专知会员服务

325+阅读 · 2020年11月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

数据要素发展报告(2025年)：附下载

人工智能代理提升战时舰船战备水平

【NeurIPS2025教程】大语言模型规划

NeurIPS 2025 教程：深度学习训练不稳定性的理论洞见

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

Aligning Large Language Models with Human: A Survey

Arxiv

1+阅读 · 2023年7月24日

Remote Bio-Sensing: Open Source Benchmark Framework for Fair Evaluation of rPPG

Arxiv

0+阅读 · 2023年7月24日

Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework

Arxiv

0+阅读 · 2023年7月24日

Practical and Ethical Challenges of Large Language Models in Education: A Systematic Scoping Review

Arxiv

0+阅读 · 2023年7月22日

Code Detection for Hardware Acceleration Using Large Language Models

Arxiv

0+阅读 · 2023年7月19日

Towards Reasoning in Large Language Models: A Survey

Arxiv

34+阅读 · 2022年12月20日

Deep Generative Models on 3D Representations: A Survey

Arxiv

15+阅读 · 2022年10月27日

Transformers in Medical Imaging: A Survey

Arxiv

15+阅读 · 2022年1月24日

Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy

Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy

Arxiv

42+阅读 · 2020年12月21日

Generative Adversarial Networks: A Survey and Taxonomy

Generative Adversarial Networks: A Survey and Taxonomy

Arxiv

14+阅读 · 2019年6月4日

相关基金

RAO3(MO)k氧化物自然超晶格界面调控及其对热电传输性能优化的研究

国家自然科学基金

0+阅读 · 2015年12月31日

多形态Au、Ag纳米颗粒自组装体修饰磁负载TiO2纳米纤维的合成与性能研究

国家自然科学基金

0+阅读 · 2013年12月31日

含随机时滞、分布式时滞的分数阶非线性系统鲁棒稳定性及控制

国家自然科学基金

0+阅读 · 2013年12月31日

深能级陷阱对CdZnTe晶体电学性能及辐射探测器性能的影响

国家自然科学基金

0+阅读 · 2012年12月31日

TCO/Cu界面的润湿性及其对铜基电触头材料致密化及性能的影响

国家自然科学基金

0+阅读 · 2012年12月31日

菊糖果糖基转移酶的结构功能解析与分子改造研究

国家自然科学基金

0+阅读 · 2012年12月31日

松香改性壳聚糖阳离子表面活性剂合成及其构效关系

国家自然科学基金

0+阅读 · 2011年12月31日

钙钛矿型多铁性异质结的界面调控磁电耦合效应研究

国家自然科学基金

0+阅读 · 2011年12月31日

功能化碳化硅基复合材料的界面调控与机理研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于Tm,Ho晶体的种子光注入锁定可连续调谐2μ#21333;频激光器

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员