In this paper we aim to provide analysis and insights (often based on visualization), which explain the beneficial effects of on-line decision making on top of off-line training. In particular, through a unifying abstract mathematical framework, we show that the principal AlphaZero/TD-Gammon ideas of approximation in value space and rollout apply very broadly to deterministic and stochastic optimal control problems, involving both discrete and continuous search spaces. Moreover, these ideas can be effectively integrated with other important methodologies such as model predictive control, adaptive control, decentralized control, discrete and Bayesian optimization, neural network-based value and policy approximations, and heuristic algorithms for discrete optimization.
翻译:在本文中,我们的目标是提供分析和洞察力(通常以可视化为基础),解释在线决策在离线培训之上的有益影响,特别是通过统一的抽象数学框架,我们展示了在价值空间和推出中近似的主要阿尔法泽罗/TD-Gammon理念非常广泛地适用于确定性和随机最佳控制问题,涉及离散和连续搜索空间。此外,这些理念可以有效地与其他重要方法相结合,如模型预测控制、适应控制、分散控制、离散和贝叶西亚优化、神经网络价值和政策近似,以及离散优化的超值算法。