Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a \textbf{launchpad} to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.
翻译:高级在线 RL 工作调度员可以学习高效的日程安排战略,但往往需要数千个时间步骤来探索环境,并根据随机初始化的 DNN 政策进行调整。现有的 RL 调度员忽略了从历史数据中学习和改进定制的疲劳政策的重要性。 离线强化学习展示了在没有在线环境互动的情况下从预录数据集中优化政策的前景。 在数据驱动学习最近成功之后,我们探索了两种RL 方法:(1) 行为克隆和(2) 脱线 RL,目的是从登录数据中学习政策,而无需与环境互动。这些方法解决了数据收集和安全成本方面的挑战,特别是与RL 实际应用相关的成本。尽管数据驱动RL 方法产生了良好的结果,但我们表明,从数据驱动器的预录数据集的质量高度依赖了在线数据优化。 最后,我们证明,通过将先前的专家演示有效纳入预编程,我们缩短了随机探索阶段,以便从在线培训中学习合理的政策。我们利用离线 RL 将离线访问作为数据收集和安全的成本, 特别是与RL 与RL 实际应用的RL 和持续更新的Orevlaveal 政策, 学习了从先前和持续更新的系统。