In this paper, we investigate the combination of synthesis, model-based learning, and online sampling techniques to obtain safe and near-optimal schedulers for a preemptible task scheduling problem. Our algorithms can handle Markov decision processes (MDPs) that have 1020 states and beyond which cannot be handled with state-of-the art probabilistic model-checkers. We provide probably approximately correct (PAC) guarantees for learning the model. Additionally, we extend Monte-Carlo tree search with advice, computed using safety games or obtained using the earliest-deadline-first scheduler, to safely explore the learned model online. Finally, we implemented and compared our algorithms empirically against shielded deep Q-learning on large task systems.
翻译:在本文中,我们调查综合、基于模型的学习和在线采样技术的结合,以获得安全且接近最佳的排程器,解决一个可以避免的任务排期问题。我们的算法可以处理马克夫(Markov)决策程序(MDPs),该程序有1020个状态,超过1020个状态,无法与最先进的概率模型检查者一起处理。我们为学习模型提供了大概大致正确的(PAC)保证。此外,我们扩大了蒙特-卡洛(Monte-Carlo)树的搜索范围,提供了建议,利用安全游戏计算,或者利用最早的死线第一排排的排程器,安全地在网上探索所学的模型。最后,我们运用并比较了我们的算法,与大型任务系统中被屏蔽的深度Q-学习相比。