In this paper, we propose a deep state-action-reward-state-action (SARSA) $\lambda$ learning approach for optimising the uplink resource allocation in non-orthogonal multiple access (NOMA) aided ultra-reliable low-latency communication (URLLC). To reduce the mean decoding error probability in time-varying network environments, this work designs a reliable learning algorithm for providing a long-term resource allocation, where the reward feedback is based on the instantaneous network performance. With the aid of the proposed algorithm, this paper addresses three main challenges of the reliable resource sharing in NOMA-URLLC networks: 1) user clustering; 2) Instantaneous feedback system; and 3) Optimal resource allocation. All of these designs interact with the considered communication environment. Lastly, we compare the performance of the proposed algorithm with conventional Q-learning and SARSA Q-learning algorithms. The simulation outcomes show that: 1) Compared with the traditional Q learning algorithms, the proposed solution is able to converges within \myb{200} episodes for providing as low as $10^{-2}$ long-term mean error; 2) NOMA assisted URLLC outperforms traditional OMA systems in terms of decoding error probabilities; and 3) The proposed feedback system is efficient for the long-term learning process.
翻译:在本文中,我们建议采用一个深度的国家-行动-奖励-状态-行动(SASA) $=lambda$学习方法,优化非横向多重存取(NOMA)帮助的超可靠低纬度通信(URLLC)的上行资源配置。为减少时间变化网络环境中的平均解码错误概率,这项工作设计了一个可靠的学习算法,以提供长期资源分配,奖励反馈以即时网络性能为基础。在拟议算法的帮助下,本文件讨论了诺马-URLLC网络可靠资源共享的三大挑战:1)用户群集;2)非即时反馈系统;和3)最佳资源分配。所有这些设计都与经过深思熟虑的通信环境相互作用。最后,我们将拟议的算法的性能与传统的Q-学习和SAQ学习算法相比较。模拟结果显示:(1) 与传统的Q学习算法相比,拟议的解决办法能够在NOMA-URLC 网络网络的可靠资源共享中找到一个主要挑战:1) 用户群;2) 即时反馈系统;2 以长期的低值提供MA 长期学习误差 ; MA 长期 MA 长期 的系统 的低值 ; MA MA 长期 MA MA 长期 MA 的 长期 的 MA 的 长期 的 MA 的 MA 的 的 的 的 MA MA 的 MA MA 的 的 MA 的 的 MA 的 的 的 。