Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community's increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal \emph{deployment complexity}, whereas in each deployment the policy can sample a large batch of data. Using finite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efficiency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efficiency. Moreover, our formulation for DE-RL is flexible and can serve as a building block for other practically relevant settings; we give "Safe DE-RL" and "Sample-Efficient DE-RL" as two examples, which may be worth future investigation.
翻译:部署效率是许多实际应用强化学习(RL)的重要标准。尽管社区的兴趣日益浓厚,但对于这一问题缺乏正式的理论提法。在本文中,我们从“限制最佳化”的角度提出部署效率高的RL(DE-RL)的提法:我们有兴趣探索MDP,在最低部署复杂性范围内获得接近最佳的政策,而每次部署中,政策都能够抽样提供大量数据。我们使用有限高度线性MDP作为具体的结构模型,我们通过建立信息理论下限,揭示了实现部署效率的根本局限性,并提供实现最佳部署效率的算法。此外,我们的DE-RL的提法是灵活的,可以作为其他实际相关环境的构件;我们给出“安全DE-RL”和“抽样-有效DE-RL”两个例子,可能值得今后调查。