We study reinforcement learning for the optimal control of Branching Markov Decision Processes (BMDPs), a natural extension of (multitype) Branching Markov Chains (BMCs). The state of a (discrete-time) BMCs is a collection of entities of various types that, while spawning other entities, generate a payoff. In comparison with BMCs, where the evolution of a each entity of the same type follows the same probabilistic pattern, BMDPs allow an external controller to pick from a range of options. This permits us to study the best/worst behaviour of the system. We generalise model-free reinforcement learning techniques to compute an optimal control strategy of an unknown BMDP in the limit. We present results of an implementation that demonstrate the practicality of the approach.
翻译:我们研究如何加强学习,以最佳控制分流马可夫决策程序(BMDPs),这是(多型)分流马可夫链(BMCs)的自然延伸。(分流时)BMCs的状况是各类实体的集合,在生成其他实体的同时产生回报。与BMCs相比,BMDPs允许外部控制者从一系列选项中选择。这使我们能够研究系统的最佳/最坏行为。我们推广了无型强化学习技术,以计算限制范围内未知的BMDP的最佳控制战略。我们介绍了执行结果,以显示该方法的实用性。