采用土匪方式进行无冲突多试剂Q学习,以实施光感效果 (Bandit approach to conflict-free multi-agent Q-learning in view of photonic implementation)

Recently, extensive studies on photonic reinforcement learning to accelerate the process of calculation by exploiting the physical nature of light have been conducted. Previous studies utilized quantum interference of photons to achieve collective decision-making without choice conflicts when solving the competitive multi-armed bandit problem, a fundamental example of reinforcement learning. However, the bandit problem deals with a static environment where the agent's action does not influence the reward probabilities. This study aims to extend the conventional approach to a more general multi-agent reinforcement learning targeting the grid world problem. Unlike the conventional approach, the proposed scheme deals with a dynamic environment where the reward changes because of agents' actions. A successful photonic reinforcement learning scheme requires both a photonic system that contributes to the quality of learning and a suitable algorithm. This study proposes a novel learning algorithm, discontinuous bandit Q-learning, in view of a potential photonic implementation. Here, state-action pairs in the environment are regarded as slot machines in the context of the bandit problem and an updated amount of Q-value is regarded as the reward of the bandit problem. We perform numerical simulations to validate the effectiveness of the bandit algorithm. In addition, we propose a multi-agent architecture in which agents are indirectly connected through quantum interference of light and quantum principles ensure the conflict-free property of state-action pair selections among agents. We demonstrate that multi-agent reinforcement learning can be accelerated owing to conflict avoidance among multiple agents.

翻译：最近,开展了关于光学强化学习的广泛研究,以利用光的物理性质加速计算过程。以前的研究利用光子量度干扰,在解决竞争性多武装土匪问题时,在没有选择冲突的情况下实现集体决策,这是加强学习的一个基本例子。不过,土匪问题涉及一个静态的环境,在这种环境中,代理人的行动不会影响奖励的概率。本研究的目的是将常规方法扩大到针对网络世界问题的更普遍的多机构强化学习。与常规方法不同,拟议的计划涉及一种动态环境,在这个环境中,奖励因代理人的行动而发生的改变。成功的光学强化学习计划既需要一个有助于学习质量的光学系统,又需要一个合适的算法。本研究提出了一种新的学习算法,即不连续的土匪学习Q学习,以潜在摄影应用为目的。这里,环境中的州际行动配对在土匪问题背景下被视为一个更普遍的多机构强化学习机器。我们进行了数字模拟,以验证土匪强化剂的强化剂的实效。我们建议通过一个间接的州级代理机构来展示一个压式的压式结构。我们建议,通过一个州级的压式的压式的压式的压式的压式的压式代理机构,可以确保自由的压压式的压式的压压压式的压式的压。