In many, if not every realistic sequential decision-making task, the decision-making agent is not able to model the full complexity of the world. The environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Currently, a popular approach for tackling this problem is to learn the agent-state function via a recurrent network from the agent's sensory stream as input. Many impressive reinforcement learning applications have instead relied on environment-specific functions to aid the agent's inputs for history summarization. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates. Although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this work we explore this idea further, and relate these auxiliary inputs to prior classic approaches to state construction. We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance.
翻译:在许多情况下,如果不是每一个符合现实的顺序决策任务,决策代理机构都无法模拟完全复杂的世界。环境往往比代理人大得多,而且更加复杂。环境往往比代理人更复杂,这种环境也被称为部分可观察性。在这样的环境下,代理人必须更多地利用当前感官投入;它必须建立一个能总结以前与世界互动的代理国家。目前,解决这一问题的流行办法是通过代理人的感官流作为投入的经常性网络来学习代理国功能。许多令人印象深刻的强化学习应用软件却依靠环境特定功能来帮助代理人提供历史平衡化的投入。这些增强软件是以多种方式进行的,从简单的方法,例如将观察同不确定性估计等比较复杂的方法相结合。尽管在实地,这些额外的投入(我们用辅助性投入来做辅助性投入)很少得到强调,而且不清楚它们的作用或影响。在这项工作中,我们进一步探讨这一想法,并将这些辅助性投入与先前的典型方法联系起来。我们用一系列例子来说明使用不同的时间动作作为辅助性投入来强化业绩。我们用这些辅助性的方法来表明,我们用这些辅助性的投入可以更加平稳地学习。