The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.
翻译:强化学习的标准反馈模式要求披露每个被访问的州-州-行动对应方的奖赏。 然而,在实践中,往往没有如此频繁的反馈。 在这项工作中,我们迈出第一步,放松这一假设,需要较弱的反馈形式,我们称之为\emph{traffory communication}。我们不是观察每次行动后获得的奖赏,而是假设我们只得到一个分数,它代表了代理人所观察到的整个轨迹的质量,即通过这一轨迹获得的所有奖赏的总和。我们根据对已知和未知的过渡模式案例的未知奖赏的最小估计,将强化学习算法推广到这一环境,并通过分析遗憾来研究这些算法的绩效。对于未知的过渡模式,我们提供了一种混合的乐观-Thompson抽样方法,其结果是一种可移动的算法。