In batch reinforcement learning, there can be poorly explored state-action pairs resulting in poorly learned, inaccurate models and poorly performing associated policies. Various regularization methods can mitigate the problem of learning overly-complex models in Markov decision processes (MDPs), however they operate in technically and intuitively distinct ways and lack a common form in which to compare them. This paper unifies three regularization methods in a common framework -- a weighted average transition matrix. Considering regularization methods in this common form illuminates how the MDP structure and the state-action pair distribution of the batch data set influence the relative performance of regularization methods. We confirm intuitions generated from the common framework by empirical evaluation across a range of MDPs and data collection policies.

### 相关内容

A multi-condition multi-objective optimization method that can find Pareto front over a defined condition space is developed for the first time using deep reinforcement learning. Unlike the conventional methods which perform optimization at a single condition, the present method learns the correlations between conditions and optimal solutions. The exclusive capability of the developed method is examined in the solutions of a novel modified Kursawe benchmark problem and an airfoil shape optimization problem which include nonlinear characteristics which are difficult to resolve using conventional optimization methods. Pareto front with high resolution over a defined condition space is successfully determined in each problem. Compared with multiple operations of a single-condition optimization method for multiple conditions, the present multi-condition optimization method based on deep reinforcement learning shows a greatly accelerated search of Pareto front by reducing the number of required function evaluations. An analysis of aerodynamics performance of airfoils with optimally designed shapes confirms that multi-condition optimization is indispensable to avoid significant degradation of target performance for varying flow conditions.

Training machine learning models in a meaningful order, from the easy samples to the hard ones, using curriculum learning can provide performance improvements over the standard training approach based on random data shuffling, without any additional computational costs. Curriculum learning strategies have been successfully employed in all areas of machine learning, in a wide range of tasks. However, the necessity of finding a way to rank the samples from easy to hard, as well as the right pacing function for introducing more difficult data can limit the usage of the curriculum approaches. In this survey, we show how these limits have been tackled in the literature, and we present different curriculum learning instantiations for various tasks in machine learning. We construct a multi-perspective taxonomy of curriculum learning approaches by hand, considering various classification criteria. We further build a hierarchical tree of curriculum learning methods using an agglomerative clustering algorithm, linking the discovered clusters with our taxonomy. At the end, we provide some interesting directions for future work.

This paper surveys the field of transfer learning in the problem setting of Reinforcement Learning (RL). RL has been the key solution to sequential decision-making problems. Along with the fast advance of RL in various domains. including robotics and game-playing, transfer learning arises as an important technique to assist RL by leveraging and transferring external expertise to boost the learning process. In this survey, we review the central issues of transfer learning in the RL domain, providing a systematic categorization of its state-of-the-art techniques. We analyze their goals, methodologies, applications, and the RL frameworks under which these transfer learning techniques would be approachable. We discuss the relationship between transfer learning and other relevant topics from an RL perspective and also explore the potential challenges as well as future development directions for transfer learning in RL.

The quest of can machines think' and can machines do what human do' are quests that drive the development of artificial intelligence. Although recent artificial intelligence succeeds in many data intensive applications, it still lacks the ability of learning from limited exemplars and fast generalizing to new tasks. To tackle this problem, one has to turn to machine learning, which supports the scientific study of artificial intelligence. Particularly, a machine learning problem called Few-Shot Learning (FSL) targets at this case. It can rapidly generalize to new tasks of limited supervised experience by turning to prior knowledge, which mimics human's ability to acquire knowledge from few examples through generalization and analogy. It has been seen as a test-bed for real artificial intelligence, a way to reduce laborious data gathering and computationally costly training, and antidote for rare cases learning. With extensive works on FSL emerging, we give a comprehensive survey for it. We first give the formal definition for FSL. Then we point out the core issues of FSL, which turns the problem from "how to solve FSL" to "how to deal with the core issues". Accordingly, existing works from the birth of FSL to the most recent published ones are categorized in a unified taxonomy, with thorough discussion of the pros and cons for different categories. Finally, we envision possible future directions for FSL in terms of problem setup, techniques, applications and theory, hoping to provide insights to both beginners and experienced researchers.

In this paper, we investigate the challenges of using reinforcement learning agents for question-answering over knowledge graphs for real-world applications. We examine the performance metrics used by state-of-the-art systems and determine that they are inadequate for such settings. More specifically, they do not evaluate the systems correctly for situations when there is no answer available and thus agents optimized for these metrics are poor at modeling confidence. We introduce a simple new performance metric for evaluating question-answering agents that is more representative of practical usage conditions, and optimize for this metric by extending the binary reward structure used in prior work to a ternary reward structure which also rewards an agent for not answering a question rather than giving an incorrect answer. We show that this can drastically improve the precision of answered questions while only not answering a limited number of previously correctly answered questions. Employing a supervised learning strategy using depth-first-search paths to bootstrap the reinforcement learning algorithm further improves performance.

In information retrieval (IR) and related tasks, term weighting approaches typically consider the frequency of the term in the document and in the collection in order to compute a score reflecting the importance of the term for the document. In tasks characterized by the presence of training data (such as text classification) it seems logical that the term weighting function should take into account the distribution (as estimated from training data) of the term across the classes of interest. Although `supervised term weighting' approaches that use this intuition have been described before, they have failed to show consistent improvements. In this article we analyse the possible reasons for this failure, and call consolidated assumptions into question. Following this criticism we propose a novel supervised term weighting approach that, instead of relying on any predefined formula, learns a term weighting function optimised on the training set of interest; we dub this approach \emph{Learning to Weight} (LTW). The experiments that we run on several well-known benchmarks, and using different learning methods, show that our method outperforms previous term weighting approaches in text classification.

Deep reinforcement learning (RL) has achieved many recent successes, yet experiment turn-around time remains a key bottleneck in research and in practice. We investigate how to optimize existing deep RL algorithms for modern computers, specifically for a combination of CPUs and GPUs. We confirm that both policy gradient and Q-value learning algorithms can be adapted to learn using many parallel simulator instances. We further find it possible to train using batch sizes considerably larger than are standard, without negatively affecting sample complexity or final performance. We leverage these facts to build a unified framework for parallelization that dramatically hastens experiments in both classes of algorithm. All neural network computations use GPUs, accelerating both data collection and training. Our results include using an entire DGX-1 to learn successful strategies in Atari games in mere minutes, using both synchronous and asynchronous algorithms.

Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of agent interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent's optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution to Nash equilibrium. Experiments on Gaussian squeeze, Ising model, and battle games justify the learning effectiveness of our mean field approaches. In addition, we report the first result to solve the Ising model via model-free reinforcement learning methods.

In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. The optimal cost function of the aggregate problem, a nonlinear function of the features, serves as an architecture for approximation in value space of the optimal cost function or the cost functions of policies of the original problem. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with reinforcement learning based on deep neural networks, which is used to obtain the needed features. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by deep reinforcement learning, thereby potentially leading to more effective policy improvement.

Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen "robustly": commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.

Sejin Kim,Innyoung Kim,Donghyun You
0+阅读 · 10月10日
Petru Soviany,Radu Tudor Ionescu,Paolo Rota,Nicu Sebe
16+阅读 · 1月25日
Zhuangdi Zhu,Kaixiang Lin,Jiayu Zhou
16+阅读 · 2020年9月16日
Yaqing Wang,Quanming Yao
325+阅读 · 2019年4月10日
Fréderic Godin,Anjishnu Kumar,Arpit Mittal
5+阅读 · 2019年4月3日
Alejandro Moreo Fernández,Andrea Esuli,Fabrizio Sebastiani
8+阅读 · 2019年3月28日
5+阅读 · 2019年1月10日
Yaodong Yang,Rui Luo,Minne Li,Ming Zhou,Weinan Zhang,Jun Wang
3+阅读 · 2018年6月12日
Chiyuan Zhang,Oriol Vinyals,Remi Munos,Samy Bengio
7+阅读 · 2018年4月20日

CreateAMind
11+阅读 · 2019年5月22日
CreateAMind
8+阅读 · 2019年5月18日
CreateAMind
6+阅读 · 2019年1月18日
CreateAMind
7+阅读 · 2019年1月7日
CreateAMind
29+阅读 · 2019年1月3日

23+阅读 · 2018年11月1日

21+阅读 · 2018年5月27日
CreateAMind
16+阅读 · 2018年5月25日
CreateAMind
9+阅读 · 2018年4月27日
CreateAMind
9+阅读 · 2017年7月21日
Top