The explosive growth of dynamic and heterogeneous data traffic brings great challenges for 5G and beyond mobile networks. To enhance the network capacity and reliability, we propose a learning-based dynamic time-frequency division duplexing (D-TFDD) scheme that adaptively allocates the uplink and downlink time-frequency resources of base stations (BSs) to meet the asymmetric and heterogeneous traffic demands while alleviating the inter-cell interference. We formulate the problem as a decentralized partially observable Markov decision process (Dec-POMDP) that maximizes the long-term expected sum rate under the users' packet dropping ratio constraints. In order to jointly optimize the global resources in a decentralized manner, we propose a federated reinforcement learning (RL) algorithm named federated Wolpertinger deep deterministic policy gradient (FWDDPG) algorithm. The BSs decide their local time-frequency configurations through RL algorithms and achieve global training via exchanging local RL models with their neighbors under a decentralized federated learning framework. Specifically, to deal with the large-scale discrete action space of each BS, we adopt a DDPG-based algorithm to generate actions in a continuous space, and then utilize Wolpertinger policy to reduce the mapping errors from continuous action space back to discrete action space. Simulation results demonstrate the superiority of our proposed algorithm to benchmark algorithms with respect to system sum rate.
翻译:动态和多样化数据传输的爆炸性增长给5G和移动网络以外的5G数据传输带来了巨大的挑战。为了提高网络能力和可靠性,我们提议了一个基于学习的动态时频分解(D-TDFD)计划,该计划将基站的上链和下链时间频资源根据适应性地分配,以满足不对称和差异性交通需求,同时减轻细胞间干扰。我们将这一问题发展成一个分散化的、部分可观测的Markov决策程序(Dec-POMDP),在用户的组合下降比率限制下最大限度地实现长期预期总和率。为了以分散的方式联合优化全球资源,我们提议了一个名为“F-T-TDDFDD”的联动式强化学习算法(RL),即名为“FWDDPG 深度确定性政策梯度(FWDDPG)”算法。BS通过RL算法决定其本地时间频度配置,并通过在分散化的联邦化学习框架下与邻居交换当地RL模型,实现全球培训。具体来说,为了处理每一个BS的大规模离散行动空间定位空间定位空间定位空间定位空间定位空间定位空间定位空间定位空间定位,我们采用了一种基于DDPG的系统至SAL级算法,以生成至SAL级算法,在连续空间定位上的行动,以持续地显示空间定位行动至SMA结果的连续定位速度上的动作。