Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity, shed light on the range of approaches to them and develop a robust framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (i) identifies different environments encountered by the live system, (ii) triggers exploration when necessary, (iii) takes precautions to retain knowledge from prior environments, and (iv) employs safeguards to protect the system's performance when the RL agent makes mistakes. We apply our framework to two systems problems, straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that all components of the framework are necessary to cope with non-stationarity and provide guidance on alternative design choices for each component.
翻译:最近的研究已转向加强学习,以解决具有挑战性的决策问题,作为手动调整的超光学的替代方法。RL可以学习良好的政策,而不必模拟环境动态。尽管有这一希望,但RL仍然是许多现实世界系统问题的一种不切实际的解决办法。当环境随时间变化而变化时,一个特别具有挑战性的案例出现,即它表现出非常态性。在这项工作中,我们描述非常态所带来的挑战,阐明应对这些挑战的各种办法,并制定一个强有力的框架,以在现场系统中培训RL代理物。这些代理物必须探索和学习新的环境,而不必损害系统的性能,并随着时间的推移记住这些环境。为此目的,我们的框架(一) 查明活系统遇到的不同环境,(二) 必要时触发探索,(三) 采取预防措施,保留先前环境中的知识,以及(四) 在RL代理物犯错误时,采取保障措施来保护系统的业绩。我们的框架适用于两个系统问题,即分散式减缓和调整视频流动。我们用各种替代方法来评估它,而不是用现实和合成数据框架来显示各种必要的替代方法。