In this paper, we investigate a sequential power allocation problem over fast varying channels, aiming to minimize the expected sum power while guaranteeing the transmission success probability. In particular, a reinforcement learning framework is constructed with appropriate reward design so that the optimal policy maximizes the Lagrangian of the primal problem, where the maximizer of the Lagrangian is shown to have several good properties. For the model-based case, a fast converging algorithm is proposed to find the optimal Lagrange multiplier and thus the corresponding optimal policy. For the model-free case, we develop a three-stage strategy, composed in order of online sampling, offline learning, and online operation, where a backward Q-learning with full exploitation of sampled channel realizations is designed to accelerate the learning process. According to our simulation, the proposed reinforcement learning framework can solve the primal optimization problem from the dual perspective. Moreover, the model-free strategy achieves a performance close to that of the optimal model-based algorithm.
翻译:在本文中,我们调查了快速不同渠道的相继权力分配问题,目的是在保证传输成功概率的同时最大限度地减少预期的总和权力。特别是,通过适当的奖励设计构建了一个强化学习框架,以便最佳政策最大限度地发挥拉格朗加人对原始问题的认识,在这个问题上,拉格朗加人的最大作用被证明具有若干良好的特性。对于以模型为基础的案例,提出了快速趋同算法,以找到最佳拉格朗格乘数,从而找到相应的最佳政策。对于无模型的情况,我们制定了一个三阶段战略,按照在线抽样、离线学习和在线操作的顺序,在此过程中,充分利用抽样频道的落后的Q学习旨在加速学习过程。根据我们的模拟,拟议的强化学习框架可以从双重角度解决最优化的原始问题。此外,无模型战略的绩效接近于最佳模型算法。