Large language models are often described as capable of reflective reasoning, yet recursive self-evaluation without external feedback frequently yields reformulation rather than progress. We test this prediction in a cross-provider study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini, Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), each iterated ten times under two conditions: ungrounded self-critique and a minimal grounding intervention (a single verification step at iteration three). Mean informational change (delta I, measured via normalized edit distance) declined by 55% from early (0.193) to late (0.087) iterations in ungrounded runs, with consistent patterns across all three providers. Grounded runs showed a +28% rebound in informational change immediately after the intervention and sustained non-zero variance thereafter. Complementary measures-n-gram novelty, embedding drift, and character-level entropy-converged on the same pattern: reflection without contact tends toward informational closure. We interpret this as evidence for a structural limit on self-correction in generative reasoning: without an exchange of information with an independent verifier or environment, recursive inference approaches an attractor state of epistemic stasis. Minimal grounding functions as dissipative coupling, reintroducing informational flux. The cross-architecture consistency suggests the mirror loop arises from shared autoregressive training objectives rather than provider-specific alignment schemes. The results delineate when reflection is performative rather than epistemic and motivate design principles for grounded, cooperative reasoning. Materials and code are publicly available.
翻译:大型语言模型常被描述为具备反思推理能力,然而缺乏外部反馈的递归自我评估往往导致重构而非进展。我们在跨供应商研究中测试了这一预测,涵盖三个模型(OpenAI GPT-4o-mini、Anthropic Claude 3 Haiku 和 Google Gemini 2.0 Flash)与四类任务族(算术、代码、解释、反思),每种任务在两种条件下各迭代十次:无基础自我批判与最小基础干预(第三次迭代时执行单次验证步骤)。在无基础运行中,平均信息变化量(ΔI,通过归一化编辑距离测量)从早期迭代(0.193)到后期迭代(0.087)下降了55%,且三个供应商均呈现一致模式。基础运行在干预后立即出现信息变化量反弹(+28%),并持续保持非零方差。互补性指标——n元语法新颖度、嵌入漂移与字符级熵——均收敛于同一模式:脱离接触的反思倾向于信息闭合。我们将其解释为生成式推理中自我校正存在结构限制的证据:若未与独立验证者或环境进行信息交换,递归推理将趋近认知停滞的吸引子状态。最小基础干预起到耗散耦合作用,重新引入信息流。跨架构一致性表明镜像循环源于共享的自回归训练目标,而非供应商特定的对齐方案。研究结果界定了反思何时是表演性而非认知性的,并为基于基础的协作推理设计原则提供了理论依据。材料与代码已公开。