天赐良机还是误入歧途？评估大语言模型幻觉的智能性与缺陷性 (Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations)

Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific innovation.Additionally, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.

翻译：大语言模型（LLM）中的幻觉通常被视为需要最小化的错误。然而，近期观点认为，某些幻觉可能编码了具有创造性或认识论价值的内容，这一维度在当前文献中仍未得到充分量化。现有的幻觉检测方法主要关注事实一致性，难以处理异构的科学任务以及平衡创造性与准确性。为应对这些挑战，我们提出了HIC-Bench，这是一个新颖的评估框架，将幻觉分类为智能幻觉（Intelligent Hallucinations, IH）与缺陷幻觉（Defective Hallucinations, DH），从而能够系统性地研究它们在LLM创造力中的相互作用。HIC-Bench具有三个核心特征：（1）结构化的IH/DH评估，采用一个多维度量矩阵，该矩阵整合了托兰斯创造性思维测试（Torrance Tests of Creative Thinking, TTCT）的度量指标（原创性、可行性、价值）与幻觉特有的维度（科学合理性、事实偏差）；（2）跨领域适用性，涵盖十个科学领域的开放式创新任务；（3）动态提示优化，利用动态幻觉提示（Dynamic Hallucination Prompt, DHP）引导模型产生兼具创造性和可靠性的输出。评估过程采用多个LLM作为评判者，通过平均得分来减轻偏见，并由人工标注者验证IH/DH分类。实验结果表明，IH与DH之间存在非线性关系，证明了创造性与正确性可以协同优化。这些见解将IH定位为创造力的催化剂，并揭示了LLM幻觉驱动科学创新的能力。此外，HIC-Bench为推进LLM幻觉创造性智能的研究提供了一个宝贵的平台。

相关内容

大语言模型

关注 62

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

机器学习可解释如何客观评估？CMU-Yeh博士论文《可解释机器学习的客观标准》，148页pdf

专知会员服务

79+阅读 · 2022年11月23日

【NAACL2022】自然语言处理的对比数据与学习

专知会员服务

46+阅读 · 2022年7月10日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

我们真的需要深度学习模型来预测时间序列吗? Do We Really Need Deep Learning Models for Time Series Forecasting?

专知会员服务

37+阅读 · 2022年3月13日