Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of \emph{regret}. We first empirically study the {no-regret} behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel \emph{unsupervised} training loss of \emph{regret-loss}, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above ``regrettable'' cases.
翻译:随着基于大型语言模型(LLM)的自主智能体的发展,大型语言模型越来越多地被用于(交互式)决策。尽管其取得了初步成功,但LLM智能体在决策中的性能尚未通过量化指标得到充分研究,尤其是在多智能体相互交互的场景中——这正是现实世界LLM智能体应用的典型情境。为了更好地理解LLM智能体在这些交互环境中的局限性,我们提出通过**遗憾**这一性能指标,研究其在在线学习与博弈论基准决策设置中的交互行为。我们首先实证研究了LLM在经典(非平稳)在线学习问题中的**无遗憾**行为,以及LLM智能体通过重复博弈交互时均衡的出现。随后,在对监督预训练及生成数据的人类决策者理性模型的特定假设下,我们为LLM智能体的无遗憾行为提供了理论解释。值得注意的是,我们也识别出GPT-4等先进LLM无法实现无遗憾行为的(简单)案例。为促进无遗憾行为,我们提出了一种新颖的无监督训练损失——**遗憾损失**,该损失与监督预训练损失不同,无需(最优)动作的标签。我们随后建立了遗憾损失最小化的泛化界统计保证,并给出了优化保证:最小化此类损失可自动导出已知的无遗憾学习算法。进一步的实验证明了我们提出的遗憾损失的有效性,尤其是在解决上述“可遗憾”案例方面。