Hidden confounding remains a fundamental challenge in causal inference from observational data. Recent advances leverage Large Language Models (LLMs) to generate plausible hidden confounders based on domain knowledge, yet a critical gap exists: LLM-generated confounders often exhibit semantic plausibility without statistical utility. We propose VIGOR+ (Variational Information Gain for iterative cOnfounder Refinement), a novel framework that closes the loop between LLM-based confounder generation and CEVAE-based statistical validation. Unlike prior approaches that treat generation and validation as separate stages, VIGOR+ establishes an iterative feedback mechanism: validation signals from CEVAE (including information gain, latent consistency metrics, and diagnostic messages) are transformed into natural language feedback that guides subsequent LLM generation rounds. This iterative refinement continues until convergence criteria are met. We formalize the feedback mechanism, prove convergence properties under mild assumptions, and provide a complete algorithmic framework.
翻译:隐混杂因子仍然是观测数据因果推断中的一个根本性挑战。近期研究利用大型语言模型基于领域知识生成合理的隐混杂因子,但仍存在一个关键缺陷:LLM生成的混杂因子通常具有语义合理性而缺乏统计效用。我们提出VIGOR+(用于迭代混杂因子精炼的变分信息增益),这是一个将基于LLM的混杂因子生成与基于CEVAE的统计验证形成闭环的新型框架。与先前将生成和验证视为独立阶段的方法不同,VIGOR+建立了一种迭代反馈机制:来自CEVAE的验证信号(包括信息增益、潜在一致性度量和诊断信息)被转化为自然语言反馈,用以指导后续的LLM生成轮次。此迭代精炼过程持续进行直至满足收敛准则。我们形式化了该反馈机制,证明了在温和假设下的收敛性质,并提供了一个完整的算法框架。