网上网络资源分配示范强化学习框架 (Model-Based Reinforcement Learning Framework of Online Network Resource Allocation)

Online Network Resource Allocation (ONRA) for service provisioning is a fundamental problem in communication networks. As a sequential decision-making under uncertainty problem, it is promising to approach ONRA via Reinforcement Learning (RL). But, RL solutions suffer from the sample complexity issue; i.e., a large number of interactions with the environment needed to find an efficient policy. This is a barrier to utilize RL for ONRA as on one hand, it is not practical to train the RL agent offline due to lack of information about future requests, and on the other hand, online training in the real network leads to significant performance loss because of the sub-optimal policy during the prolonged learning time. This performance degradation is even higher in non-stationary ONRA where the agent should continually adapt the policy with the changes in service requests. To deal with this issue, we develop a general resource allocation framework, named RADAR, using model-based RL for a class of ONRA problems with the known immediate reward of each action. RADAR improves sample efficiency via exploring the state space in the background and exploiting the policy in the decision-time using synthetic samples by the model of the environment, which is trained by real interactions. Applying RADAR on the multi-domain service federation problem, to maximize profit via selecting proper domains for service requests deployment, shows its continual learning capability and up to 44% performance improvement w.r.t. the standard model-free RL solution.

翻译：在线网络资源分配(ONRA)对于服务提供是一个根本性的通信网络问题。作为在不确定性问题下的一个顺序决策,有希望通过强化学习(RL)与ONRA进行接触。但是,RL的绩效退化在非静止的ONRA中甚至更加严重,因为代理机构应不断根据服务要求的变化调整政策。为了解决这个问题,我们开发了一个名为RADAR的一般资源分配框架,即RADAR, 使用基于模型的RL解决ONRA的一类问题,即已知的对每项行动的直接奖赏。RADAR通过在背景中探索国家空间和在长期学习期间利用决策期间的政策,提高抽样效率。在非静止的ONRA中,这种绩效退化甚至更高,因为代理机构应不断根据服务要求的变化调整政策。为了解决这个问题,我们开发了一个名为RADAR的一般资源分配框架,使用示范性RL解决一系列问题,即立即奖励每一项行动。RAAR通过探索背景中的国家空间和利用决策期间的政策改进了绩效。通过经过培训的RADA标准化服务领域,通过模型,将经过培训的ASim Streal Streal eximal eximal repact in restial restial eximal replade pre avidudududustr vidustr vidududududustr vidududustr vidu vical vidu vi vidu vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi vi 一种通过通过通过通过通过通过模型,通过模拟,通过通过通过通过经过经过经过对一个经过如何选择一个模拟选择一个模拟选择一个模型来选择一个模拟来选择一个模拟到一个可靠的的模型来选择一个可靠的的模型来选择一个模拟来选择一个真正的服务环境来选择一个真正的服务 vi vi vi 。