Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs' trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs' behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs' stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines.
翻译:大语言模型通过训练于多样的人类文本语料,已展现出通过提示激发特定类人特质(如个性或价值观)的潜力,这有助于个性化大语言模型和社会模拟等应用。然而,现有方法存在特质激发表面化的问题:大语言模型仅能被引导模仿浅层且不稳定的风格模式,无法像人类一样在不同任务中精确且一致地体现目标特质。为解决这一挑战,我们提出IROTE,一种新颖的上下文方法,用于实现稳定且可迁移的特质激发。基于心理学理论——特质通过身份相关的反思形成,我们的方法自动生成并优化提示中的文本自反思内容(包含自我感知的经验),以激发大语言模型的特质驱动行为。优化过程通过迭代最大化信息论目标实现,该目标增强大语言模型行为与目标特质之间的关联,同时减少反思中的噪声冗余,无需任何微调,从而产生具有启发性和紧凑性的特质反思。在三个人类特质体系上的广泛实验表明,单个IROTE生成的自反思能够诱导大语言模型在超越简单问卷回答的多样下游任务中稳定模拟目标特质,其性能持续优于现有强基线方法。