Many real-world settings involve costs for performing actions; transaction costs in financial systems and fuel costs being common examples. In these settings, performing actions at each time step quickly accumulates costs leading to vastly suboptimal outcomes. Additionally, repeatedly acting produces wear and tear and ultimately, damage. Determining when to act is crucial for achieving successful outcomes and yet, the challenge of efficiently learning to behave optimally when actions incur minimally bounded costs remains unresolved. In this paper, we introduce a reinforcement learning (RL) framework named Learnable Impulse Control Reinforcement Algorithm (LICRA), for learning to optimally select both when to act and which actions to take when actions incur costs. At the core of LICRA is a nested structure that combines RL and a form of policy known as impulse control which learns to maximise objectives when actions incur costs. We prove that LICRA, which seamlessly adopts any RL method, converges to policies that optimally select when to perform actions and their optimal magnitudes. We then augment LICRA to handle problems in which the agent can perform at most $k<\infty$ actions and more generally, faces a budget constraint. We show LICRA learns the optimal value function and ensures budget constraints are satisfied almost surely. We demonstrate empirically LICRA's superior performance against benchmark RL methods in OpenAI gym's Lunar Lander and in Highway environments and a variant of the Merton portfolio problem within finance.
翻译:许多现实世界环境涉及采取行动的成本;金融系统的交易成本和燃料成本是常见的例子。在这些环境中,每次采取行动都迅速积累成本,导致极不理想的结果。此外,反复采取行动会产生磨损和撕裂,最终造成损害。确定何时采取行动对于取得成功结果至关重要,然而,在行动需要最低约束成本时,如何高效学习以最佳方式行事的挑战依然存在。在本文件中,我们引入了一个强化学习(RL)框架,名为“可学习的内存控制增强阿尔哥里特姆(LIRA)”,用于学习最佳选择何时采取行动和在行动产生成本时采取何种行动。在LICRA的核心是一个嵌套结构,将RL和一种被称为“冲动控制”的政策形式结合起来,在行动产生成本时学会实现最大目标。我们证明,LICRA是无缝地采用任何RL方法,在行动及其最佳规模时最优化选择的政策。我们随后加强LICRA,以处理代理人在最大程度上以美元作为交易单位的行动和行动需要时,更一般地说,我们用LICA标准来证明预算上最优的制约。