了解信息搜索探索的起源与控制概率目标 (Understanding the Origin of Information-Seeking Exploration in Probabilistic Objectives for Control)

from arxiv, 11-03-21 initial upload. 14-03-21 fix Charnov citation. 16-03-21 another fix. 25-06-21 more fixes plus numerical simulations. 30-06-21 minor fixes; 12/11/21 maths typo fix; 24/11/21 minor maths fixes

The exploration-exploitation trade-off is central to the description of adaptive behaviour in fields ranging from machine learning, to biology, to economics. While many approaches have been taken, one approach to solving this trade-off has been to equip or propose that agents possess an intrinsic 'exploratory drive' which is often implemented in terms of maximizing the agents information gain about the world -- an approach which has been widely studied in machine learning and cognitive science. In this paper we mathematically investigate the nature and meaning of such approaches and demonstrate that this combination of utility maximizing and information-seeking behaviour arises from the minimization of an entirely difference class of objectives we call divergence objectives. We propose a dichotomy in the objective functions underlying adaptive behaviour between \emph{evidence} objectives, which correspond to well-known reward or utility maximizing objectives in the literature, and \emph{divergence} objectives which instead seek to minimize the divergence between the agent's expected and desired futures, and argue that this new class of divergence objectives could form the mathematical foundation for a much richer understanding of the exploratory components of adaptive and intelligent action, beyond simply greedy utility maximization.

翻译：勘探-开发权衡是描述从机器学习、生物学到经济学等领域适应行为的核心。虽然采取了许多办法,但解决这一权衡的一个办法是,装备或提议代理人拥有内在的“探索动力”,这往往是为了最大限度地增加有关世界的代理信息收益,这是在机器学习和认知科学中广泛研究的一种方法。在本文中,我们用数学来研究这种方法的性质和含义,并表明这种效用最大化和寻求信息行为的结合产生于尽可能缩小一个完全不同的目标类别,我们称之为差异目标。我们提议将适应行为背后的客观功能分为以下两种:一种是文献中众所周知的奖励或效用最大化目标,另一种是力求最大限度地缩小代理人预期和期望的未来之间的差距的目标。我们主张,这种新的差异目标类别可以构成数学基础,以便更深入地了解适应和智能行动的探索性组成部分,而不仅仅是贪婪效用最大化。