Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity and develop a framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (1) identifies different environments encountered by the live system, (2) explores and trains a separate expert policy for each environment, and (3) employs safeguards to protect the system's performance. We apply our framework to two systems problems: straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that each component of our framework is necessary to cope with non-stationarity.
翻译:最近的研究转而转向加强学习,以解决具有挑战性的决策问题,作为手动调整的超自然学的替代方法。RL可以学习良好的政策,而不必模拟环境动态。尽管这一承诺,RL仍然是许多现实世界系统问题的一种不切实际的解决办法。当环境随时间变化而变化时,就会出现一个特别具有挑战性的情况,即它表现出非静态性。在这项工作中,我们描述非静态性所带来的挑战,并制定一个框架,在实时系统中培训RL代理物剂。这些代理物必须探索和学习新的环境,同时不伤害系统的性能,并随着时间的推移记住它们。为此目的,我们的框架(1) 确定活系统遇到的不同环境,(2) 探索和训练针对每个环境的单独专家政策,(3) 使用保障措施来保护系统的性能。我们把我们的框架应用于两个系统问题:施压性减缓和适应性视频流,并对照使用现实世界和合成数据的各种替代方法来评价它。我们指出,我们框架的每个组成部分都有必要应对非静止性。