In this paper, we consider jointly optimizing cell load balance and network throughput via a reinforcement learning (RL) approach, where inter-cell handover (i.e., user association assignment) and massive MIMO antenna tilting are configured as the RL policy to learn. Our rationale behind using RL is to circumvent the challenges of analytically modeling user mobility and network dynamics. To accomplish this joint optimization, we integrate vector rewards into the RL value network and conduct RL action via a separate policy network. We name this method as Pareto deterministic policy gradients (PDPG). It is an actor-critic, model-free and deterministic policy algorithm which can handle the coupling objectives with the following two merits: 1) It solves the optimization via leveraging the degree of freedom of vector reward as opposed to choosing handcrafted scalar-reward; 2) Cross-validation over multiple policies can be significantly reduced. Accordingly, the RL enabled network behaves in a self-organized way: It learns out the underlying user mobility through measurement history to proactively operate handover and antenna tilt without environment assumptions. Our numerical evaluation demonstrates that the introduced RL method outperforms scalar-reward based approaches. Meanwhile, to be self-contained, an ideal static optimization based brute-force search solver is included as a benchmark. The comparison shows that the RL approach performs as well as this ideal strategy, though the former one is constrained with limited environment observations and lower action frequency, whereas the latter ones have full access to the user mobility. The convergence of our introduced approach is also tested under different user mobility environment based on our measurement data from a real scenario.
翻译:在本文中,我们考虑通过强化学习(RL)方法,共同优化细胞负载平衡和网络吞吐,将细胞间交接(即用户关联分配)和大型MIMO天线倾斜配置成RL政策学习。我们使用RL的理由在于避免分析模拟用户流动性和网络动态的挑战。为了实现这一联合优化,我们将矢量奖励纳入RL值网络,并通过一个单独的政策网络实施RL行动。我们将这种方法命名为Pareto确定性政策趋同梯度。这是一个行为体间交接(即用户关联分配)和大规模MIMO天线倾斜的缩动政策算法,可以按照以下两种优点处理组合目标:1)通过利用矢量奖励自由度而不是选择手制的变速回升;2)对多种政策的交叉校验可以大大降低。因此,RL使网络能够以自组织的方式运行:通过测量历史来测试基本的用户流动性,可以主动操作移交和天线向环境倾斜。我们在用户间进行自我评估时,通过一种基于自我定位的自我评估,通过一种基于最精确的流程的自我评估,将一个基于原始的流程的流程的流程进行自我分析,显示一种基于的流程的流程的流程的流程的流程的流程的自我分析方法显示一种基于一种基于一种基于一种基于一种基于的流程的自我排序的流程的流程的流程。