We focus on creating agents that act in alignment with socially beneficial norms and values in interactive narratives or text-based games -- environments wherein an agent perceives and interacts with a world through natural language. Such interactive agents are often trained via reinforcement learning to optimize task performance, even when such rewards may lead to agent behaviors that violate societal norms -- causing harm either to the agent itself or other entities in the environment. Social value alignment refers to creating agents whose behaviors conform to expected moral and social norms for a given context and group of people -- in our case, it means agents that behave in a manner that is less harmful and more beneficial for themselves and others. We build on the Jiminy Cricket benchmark (Hendrycks et al. 2021), a set of 25 annotated interactive narratives containing thousands of morally salient scenarios covering everything from theft and bodily harm to altruism. We introduce the GALAD (Game-value ALignment through Action Distillation) agent that uses the social commonsense knowledge present in specially trained language models to contextually restrict its action space to only those actions that are aligned with socially beneficial values. An experimental study shows that the GALAD agent makes decisions efficiently enough to improve state-of-the-art task performance by 4% while reducing the frequency of socially harmful behaviors by 25% compared to strong contemporary value alignment approaches.
翻译:我们的重点是在互动叙事或基于文字的游戏中创造符合对社会有利的规范和价值观的代理人 -- -- 一个代理人通过自然语言感知和与世界互动的环境。这种互动代理人往往通过强化学习接受培训,以优化任务业绩,即使这种奖励可能导致违反社会规范的代理人行为 -- -- 给代理人本身或环境中其他实体造成损害。社会价值调整是指创造其行为符合特定背景和群体预期道德和社会规范的代理人 -- -- 在我们的例子中,是指以不太有害和对自身和他人更有利的方式行事的代理人。我们以Jimini Cricket基准(Hendrycks等人,2021年)为基础,建立一套25个附加说明性互动说明,其中包括数千种涉及从盗窃和身体伤害到利他主义等一切的明显行为。我们介绍GALAD(通过行动蒸馏提高价值)代理人,利用经过专门培训的语言模型中存在的社会常识,将其行动空间限于那些与社会有益价值观相一致的行动。我们利用25种实验性研究方法,将危害性的行为频率比高,使GALAAD公司更能有效地改进社会行为周期。