Multicasting is an efficient technique to simultaneously transmit common messages from the base station (BS) to multiple mobile users (MUs). The multicast scheduling problem over multiple channels, which jointly minimizes the energy consumption of the BS and the latency of serving asynchronized requests from the MUs, is formulated as an infinite-horizon Markov decision process (MDP) with large discrete action space, multiple time-varying constraints, and multiple time-invariant constraints, which has not been efficiently solved in the literatures. To address this problem, this paper proposes a novel algorithm called distribution-embedding multi-agent proximal policy optimization (DE-MAPPO), which consists of two parts: a modified MAPPO module and a distribution-embedding module. The former one modifies MAPPO's offline training and online applying mechanisms to handle the large discrete action space issue and time-varying constraints, and the latter one iteratively adjusts the action distribution to satisfy the time-invariant constraints. Moreover, as a benchmark, a performance upper bound of the considered MDP is derived by solving a two-step optimization problem. Numerical experiments show that the proposed algorithm achieves comparable performance to the derived benchmark in typical scenarios.
翻译:组播是一种有效的技术,可以将基站(BS)的公共消息同时传输给多个移动用户(MUs)。在多个通道上,组播调度问题联合最小化BS的能量消耗和服务于异步请求的MUs的延迟,被建模为一个无限期马尔科夫决策过程(MDP),其具有大的离散动作空间,多个时变约束和多个时不变约束,这在文献中尚未得到有效解决。为了解决这个问题,本文提出了一种称为分布嵌入多智能体近端策略优化(DE-MAPPO)的新算法,包括两部分:修改后的MAPPO模块和分布嵌入模块。前一个修改了MAPPO的离线训练和在线应用机制以处理大的离散行为空间和时变约束问题,后一个则迭代地调整行动分布以满足时不变约束。此外,作为基准,通过解决一个两步优化问题,得出了所考虑的MDP的性能上限。数值实验表明,所提出的算法在典型情况下实现了与导出基准的可比性。