Multi-user delay constrained scheduling is important in many real-world applications including wireless communication, live streaming, and cloud computing. Yet, it poses a critical challenge since the scheduler needs to make real-time decisions to guarantee the delay and resource constraints simultaneously without prior information of system dynamics, which can be time-varying and hard to estimate. Moreover, many practical scenarios suffer from partial observability issues, e.g., due to sensing noise or hidden correlation. To tackle these challenges, we propose a deep reinforcement learning (DRL) algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ($\mathtt{RSD4}$), which is a data-driven method based on a Partially Observed Markov Decision Process (POMDP) formulation. $\mathtt{RSD4}$ guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively. It also efficiently tackles partial observability with a memory mechanism enabled by the recurrent neural network (RNN) and introduces user-level decomposition and node-level merging to ensure scalability. Extensive experiments on simulated/real-world datasets demonstrate that $\mathtt{RSD4}$ is robust to system dynamics and partially observable environments, and achieves superior performances over existing DRL and non-DRL-based methods.
翻译:在许多现实应用中,包括无线通信、现场流流和云计算,多用户延迟的时间安排限制对于许多现实应用非常重要。然而,它却构成一个严峻的挑战,因为调度员需要在没有系统动态信息的情况下同时作出实时决定,以保证延迟和资源限制,而事先没有提供系统动态信息,这可能具有时间差异和难以估计。此外,许多实际情景都存在部分可观察性问题,例如,由于感知噪音或隐性相关关系等原因。为了应对这些挑战,我们提议采用深度强化学习(DRL)算法,名为Softmax延迟深层双重确定性政策渐进($matt{RSD4}),这是基于部分观察的马尔科夫决策程序(POMDP)的由数据驱动的方法。 $hatht{RSD4} 保证资源受到部分可观察性限制,例如,由于感应感应力的排队列。为了应对这些挑战,我们还高效地解决部分可耐性(DRN)由经常性的神经网络(RNNN)启动的记忆机制,并引入用户级级不适级不及不达级非级非级的不及级非级数据整合,以确保环境的SDRSDRDRDRDRDR4 。