The tutorial is written for those who would like an introduction to reinforcement learning (RL). The aim is to provide an intuitive presentation of the ideas rather than concentrate on the deeper mathematics underlying the topic. RL is generally used to solve the so-called Markov decision problem (MDP). In other words, the problem that you are attempting to solve with RL should be an MDP or its variant. The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI). We will begin with a quick description of MDPs. We will discuss what we mean by “complex” and “large-scale” MDPs. Then we will explain why RL is needed to solve complex and large-scale MDPs. The semi-Markov decision problem (SMDP) will also be covered.
The tutorial is meant to serve as an introduction to these topics and is based mostly on the book: “Simulation-based optimization: Parametric Optimization techniques and reinforcement learning” . The book discusses this topic in greater detail in the context of simulators. There are at least two other textbooks that I would recommend you to read: (i) Neuro-dynamic programming  (lots of details on convergence analysis) and (ii) Reinforcement Learning: An Introduction  (lots of details on underlying AI concepts). A more recent tutorial on this topic is . This tutorial has 2 sections: • Section 2 discusses MDPs and SMDPs. • Section 3 discusses RL. By the end of this tutorial, you should be able to • Identify problem structures that can be set up as MDPs / SMDPs. • Use some RL algorithms.
强化一词来源于实验心理学中对动物学习的研究，它指的是某一事件的发生，与某一反应之间有恰当的关系，而这一事件往往会增加该反应在相同情况下再次发生的可能性。虽然心理学家没有使用“强化学习”这个术语，但它已经被人工智能和工程领域的理论家广泛采用，用来指代基于这一强化原理的学习任务和算法。最简单的强化学习方法使用的是一个常识，即如果一个行为之后出现了一个令人满意的状态，或者一个状态的改善，那么产生该行为的倾向就会得到加强。强化学习的概念在工程领域已经存在了几十年(如Mendel和McClaren 1970)，在人工智能领域也已经存在了几十年(Minsky 1954, 1961;撒母耳1959;图灵1950)。然而，直到最近，强化学习方法的发展和应用才在这些领域占据了大量的研究人员。激发这种兴趣的是两个基本的挑战:1) 设计能够在复杂动态环境中在不确定性下运行的自主机器人代理，2) 为非常大规模的动态决策问题找到有用的近似解。
Meta-learning, or learning to learn, is the science of systematically observing how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much faster than otherwise possible. Not only does this dramatically speed up and improve the design of machine learning pipelines or neural architectures, it also allows us to replace hand-engineered algorithms with novel approaches learned in a data-driven way. In this chapter, we provide an overview of the state of the art in this fascinating and continuously evolving field.
This paper presents the first two editions of Visual Doom AI Competition, held in 2016 and 2017. The challenge was to create bots that compete in a multi-player deathmatch in a first-person shooter (FPS) game, Doom. The bots had to make their decisions based solely on visual information, i.e., a raw screen buffer. To play well, the bots needed to understand their surroundings, navigate, explore, and handle the opponents at the same time. These aspects, together with the competitive multi-agent aspect of the game, make the competition a unique platform for evaluating the state of the art reinforcement learning algorithms. The paper discusses the rules, solutions, results, and statistics that give insight into the agents' behaviors. Best-performing agents are described in more detail. The results of the competition lead to the conclusion that, although reinforcement learning can produce capable Doom bots, they still are not yet able to successfully compete against humans in this game. The paper also revisits the ViZDoom environment, which is a flexible, easy to use, and efficient 3D platform for research for vision-based reinforcement learning, based on a well-recognized first-person perspective game Doom.
This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and controls might be combined to approach these challenges.