继续强化学习任务在时间-时间-基于气候的奖励 (Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks)

In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.

翻译：在持续的任务中,平均奖励强化学习可能比更常见的折扣奖励模式更适于解决问题。与往常一样,在这一背景下学习最佳政策通常需要大量的培训经验。奖励是将域知识纳入强化学习以加快趋同于最佳政策的共同方法。然而,根据我们所知,迄今只有打折扣的环境才确定了创造奖励的理论属性。本文件为平均奖励学习提供了第一个奖赏形成框架,并证明根据标准假设,可以恢复原奖励功能下的最佳政策。为避免人工构建塑造功能的需要,我们采用了一种方法,利用以时间逻辑公式表示的域知识。该公式自动转化为一种形成功能,在整个学习过程中提供额外的奖励。我们评估了三种持续任务的拟议方法。在所有情况下,在与相关基线相比,提高平均奖励学习率的同时,不减少所学政策的执行情况。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日