In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals. The physical duration between one decision and the next becomes a critical hyperparameter. When this duration is too short, the agent needs to make many decisions to achieve its goal, aggravating the problem's difficulty. But when this duration is too long, the agent becomes incapable of controlling the system. Physical systems, however, do not need a constant control frequency. For learning agents, it is desirable to operate with low frequency when possible and high frequency when necessary. We propose a framework called Continuous-Time Continuous-Options (CTCO), where the agent chooses options as sub-policies of variable durations. Such options are time-continuous and can interact with the system at any desired frequency providing a smooth change of actions. The empirical analysis shows that our algorithm is competitive w.r.t. other time-abstraction techniques, such as classic option learning and action repetition, and practically overcomes the difficult choice of the decision frequency.
翻译:在传统的强化学习算法中,代理商以离散和固定的时间间隔作出决定。一个决定与下一个决定之间的实际持续时间将变成一个关键的超参数。当这一持续时间太短时,代理商需要做出许多决定才能实现其目标,从而加重问题的困难。但是当这一持续时间太长时,代理商便无法控制系统。但物理系统不需要固定的控制频率。对于学习代理商来说,最好在可能时以低频率运作,必要时以高频率运作。我们提议了一个称为“连续连续选择方案”的框架,即代理商选择选项作为不同持续时间的次级政策。这些选项具有时间性,可以在任何预期的频率上与系统互动,提供平稳的行动变化。实证分析表明,我们的算法是竞争性的 w.r.t.其他时间吸附技术,例如典型的选项学习和重复行动,实际上克服了决定频率的艰难选择。