使用机器激励测量人工智能特工对人类的信任 (Measuring an artificial intelligence agent's trust in humans using machine incentives)

Scientists and philosophers have debated whether humans can trust advanced artificial intelligence (AI) agents to respect humanity's best interests. Yet what about the reverse? Will advanced AI agents trust humans? Gauging an AI agent's trust in humans is challenging because--absent costs for dishonesty--such agents might respond falsely about their trust in humans. Here we present a method for incentivizing machine decisions without altering an AI agent's underlying algorithms or goal orientation. In two separate experiments, we then employ this method in hundreds of trust games between an AI agent (a Large Language Model (LLM) from OpenAI) and a human experimenter (author TJ). In our first experiment, we find that the AI agent decides to trust humans at higher rates when facing actual incentives than when making hypothetical decisions. Our second experiment replicates and extends these findings by automating game play and by homogenizing question wording. We again observe higher rates of trust when the AI agent faces real incentives. Across both experiments, the AI agent's trust decisions appear unrelated to the magnitude of stakes. Furthermore, to address the possibility that the AI agent's trust decisions reflect a preference for uncertainty, the experiments include two conditions that present the AI agent with a non-social decision task that provides the opportunity to choose a certain or uncertain option; in those conditions, the AI agent consistently chooses the certain option. Our experiments suggest that one of the most advanced AI language models to date alters its social behavior in response to incentives and displays behavior consistent with trust toward a human interlocutor when incentivized.

翻译：科学家和哲学家已经争论了人类能否信任先进的人工智能(AI)代理人来尊重人类的最佳利益。然而,反之又如何呢? 高级人工智能代理人会相信人类吗? 提高一个人工智能代理人对人类的信任是具有挑战性的,因为不诚实的这种代理人可能对其对人类的信任做出错误的反应。我们在这里提出了一个激励机器决策的方法,而不会改变一个人工智能代理人的基本算法或目标方向。在两个不同的实验中,我们随后在数以百计的人工智能代理人(OpenAI的大语言模型(LLM))和一个人类实验家(TJ作者)之间的信任游戏中使用了这种方法。在我们的第一次实验中,我们发现人工智能代理人在面对实际的激励而不是作出假设性决定时,以更高的速度信任人类。我们的第二次实验复制并扩展了这些结论,不改变一个人工智能代理人的基本算法。当AI代理人面临真正的激励时,我们再次看到更高的信任率。在两个实验中,AI代理人的信任决定似乎与利得程度无关。此外,在其中一个实验中, 将最有可能的显示一个不确定的 AI AI 选择一个不固定的实验, AI 。