A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinite-horizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate a Bellman-style equational characterization of optimality in this context and develop a model-free reinforcement learning approach to obtain optimal policies.
翻译:传统强化学习的基本假设是,奖励的价值在代理人收到后不会改变。目前的工作放弃这一假设,考虑奖励的价值与获得奖励的时间相对应地下降的情况。强调支付时出现的反差点,我们用资产一词指目前由代理人拥有的奖励。采用这一措辞,我们开始在无限量优化的框架内研究资产贬值问题。特别是,我们提出了一个资产折旧概念,其灵感来自典型的指数性贴现,资产价值在代理人获得后的每一阶段以固定的贴现系数按比例计算。我们在此背景下制定贝尔曼式的对最佳性的方程式定性,并制定一种无模型的强化学习方法,以获得最佳政策。</s>