Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific innovation.Additionally, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.
翻译:大语言模型(LLM)中的幻觉通常被视为需要最小化的错误。然而,近期观点认为,某些幻觉可能编码了具有创造性或认识论价值的内容,这一维度在当前文献中仍未得到充分量化。现有的幻觉检测方法主要关注事实一致性,难以处理异构的科学任务以及平衡创造性与准确性。为应对这些挑战,我们提出了HIC-Bench,这是一个新颖的评估框架,将幻觉分类为智能幻觉(Intelligent Hallucinations, IH)与缺陷幻觉(Defective Hallucinations, DH),从而能够系统性地研究它们在LLM创造力中的相互作用。HIC-Bench具有三个核心特征:(1)结构化的IH/DH评估,采用一个多维度量矩阵,该矩阵整合了托兰斯创造性思维测试(Torrance Tests of Creative Thinking, TTCT)的度量指标(原创性、可行性、价值)与幻觉特有的维度(科学合理性、事实偏差);(2)跨领域适用性,涵盖十个科学领域的开放式创新任务;(3)动态提示优化,利用动态幻觉提示(Dynamic Hallucination Prompt, DHP)引导模型产生兼具创造性和可靠性的输出。评估过程采用多个LLM作为评判者,通过平均得分来减轻偏见,并由人工标注者验证IH/DH分类。实验结果表明,IH与DH之间存在非线性关系,证明了创造性与正确性可以协同优化。这些见解将IH定位为创造力的催化剂,并揭示了LLM幻觉驱动科学创新的能力。此外,HIC-Bench为推进LLM幻觉创造性智能的研究提供了一个宝贵的平台。