We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

5
下载
关闭预览

相关内容

《计算机信息》杂志发表高质量的论文,扩大了运筹学和计算的范围,寻求有关理论、方法、实验、系统和应用方面的原创研究论文、新颖的调查和教程论文,以及描述新的和有用的软件工具的论文。官网链接:https://pubsonline.informs.org/journal/ijoc

Recently, deep multiagent reinforcement learning (MARL) has become a highly active research area as many real-world problems can be inherently viewed as multiagent systems. A particularly interesting and widely applicable class of problems is the partially observable cooperative multiagent setting, in which a team of agents learns to coordinate their behaviors conditioning on their private observations and commonly shared global reward signals. One natural solution is to resort to the centralized training and decentralized execution paradigm. During centralized training, one key challenge is the multiagent credit assignment: how to allocate the global rewards for individual agent policies for better coordination towards maximizing system-level's benefits. In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system's global Q-values into individual agents' Q-values. Unlike previous works which restrict the representation relation of the individual Q-values and the global one, we leverage the integrated gradient attribution technique into deep MARL to directly decompose global Q-values along trajectory paths to assign credits for agents. We evaluate QPD on the challenging StarCraft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms.

0
19
下载
预览

We propose a method to efficiently learn diverse strategies in reinforcement learning for query reformulation in the tasks of document retrieval and question answering. In the proposed framework an agent consists of multiple specialized sub-agents and a meta-agent that learns to aggregate the answers from sub-agents to produce a final answer. Sub-agents are trained on disjoint partitions of the training data, while the meta-agent is trained on the full training set. Our method makes learning faster, because it is highly parallelizable, and has better generalization performance than strong baselines, such as an ensemble of agents trained on the full data. We show that the improved performance is due to the increased diversity of reformulation strategies.

0
3
下载
预览

Fuzzing is a commonly used technique designed to test software by automatically crafting program inputs. Currently, the most successful fuzzing algorithms emphasize simple, low-overhead strategies with the ability to efficiently monitor program state during execution. Through compile-time instrumentation, these approaches have access to numerous aspects of program state including coverage, data flow, and heterogeneous fault detection and classification. However, existing approaches utilize blind random mutation strategies when generating test inputs. We present a different approach that uses this state information to optimize mutation operators using reinforcement learning (RL). By integrating OpenAI Gym with libFuzzer we are able to simultaneously leverage advancements in reinforcement learning as well as fuzzing to achieve deeper coverage across several varied benchmarks. Our technique connects the rich, efficient program monitors provided by LLVM Santizers with a deep neural net to learn mutation selection strategies directly from the input data. The cross-language, asynchronous architecture we developed enables us to apply any OpenAI Gym compatible deep reinforcement learning algorithm to any fuzzing problem with minimal slowdown.

0
3
下载
预览

This paper presents a new multi-objective deep reinforcement learning (MODRL) framework based on deep Q-networks. We propose the use of linear and non-linear methods to develop the MODRL framework that includes both single-policy and multi-policy strategies. The experimental results on two benchmark problems including the two-objective deep sea treasure environment and the three-objective mountain car problem indicate that the proposed framework is able to converge to the optimal Pareto solutions effectively. The proposed framework is generic, which allows implementation of different deep reinforcement learning algorithms in different complex environments. This therefore overcomes many difficulties involved with standard multi-objective reinforcement learning (MORL) methods existing in the current literature. The framework creates a platform as a testbed environment to develop methods for solving various problems associated with the current MORL. Details of the framework implementation can be referred to http://www.deakin.edu.au/~thanhthi/drl.htm.

0
9
下载
预览

Existing multi-agent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of agent interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent's optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Q-learning and mean field Actor-Critic algorithms and analyze the convergence of the solution to Nash equilibrium. Experiments on Gaussian squeeze, Ising model, and battle games justify the learning effectiveness of our mean field approaches. In addition, we report the first result to solve the Ising model via model-free reinforcement learning methods.

0
3
下载
预览

Policy gradient methods are often applied to reinforcement learning in continuous multiagent games. These methods perform local search in the joint-action space, and as we show, they are susceptable to a game-theoretic pathology known as relative overgeneralization. To resolve this issue, we propose Multiagent Soft Q-learning, which can be seen as the analogue of applying Q-learning to continuous controls. We compare our method to MADDPG, a state-of-the-art approach, and show that our method achieves better coordination in multiagent cooperative tasks, converging to better local optima in the joint action space.

0
10
下载
预览

We propose a fully distributed actor-critic algorithm approximated by deep neural networks, named \textit{Diff-DAC}, with application to single-task and to average multitask reinforcement learning (MRL). Each agent has access to data from its local task only, but it aims to learn a policy that performs well on average for the whole set of tasks. During the learning process, agents communicate their value-policy parameters to their neighbors, diffusing the information across the network, so that they converge to a common policy, with no need for a central node. The method is scalable, since the computational and communication costs per agent grow with its number of neighbors. We derive Diff-DAC's from duality theory and provide novel insights into the standard actor-critic framework, showing that it is actually an instance of the dual ascent method that approximates the solution of a linear program. Experiments suggest that Diff-DAC can outperform the single previous distributed MRL approach (i.e., Dist-MTLPS) and even the centralized architecture.

0
4
下载
预览

We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects. IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98.

0
5
下载
预览

The best summary of a long video differs among different people due to its highly subjective nature. Even for the same person, the best summary may change with time or mood. In this paper, we introduce the task of generating customized video summaries through simple text. First, we train a deep architecture to effectively learn semantic embeddings of video frames by leveraging the abundance of image-caption data via a progressive and residual manner. Given a user-specific text description, our algorithm is able to select semantically relevant video segments and produce a temporally aligned video summary. In order to evaluate our textually customized video summaries, we conduct experimental comparison with baseline methods that utilize ground-truth information. Despite the challenging baselines, our method still manages to show comparable or even exceeding performance. We also show that our method is able to generate semantically diverse video summaries by only utilizing the learned visual embeddings.

0
4
下载
预览

Deep reinforcement learning (RL) methods generally engage in exploratory behavior through noise injection in the action space. An alternative is to add noise directly to the agent's parameters, which can lead to more consistent exploration and a richer set of behaviors. Methods such as evolutionary strategies use parameter perturbations, but discard all temporal structure in the process and require significantly more samples. Combining parameter noise with traditional RL methods allows to combine the best of both worlds. We demonstrate that both off- and on-policy methods benefit from this approach through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks. Our results show that RL with parameter noise learns more efficiently than traditional RL with action space noise and evolutionary strategies individually.

0
3
下载
预览
小贴士
相关论文
Q-value Path Decomposition for Deep Multiagent Reinforcement Learning
Yaodong Yang,Jianye Hao,Guangyong Chen,Hongyao Tang,Yingfeng Chen,Yujing Hu,Changjie Fan,Zhongyu Wei
19+阅读 · 2020年2月10日
Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation
Rodrigo Nogueira,Jannis Bulian,Massimiliano Ciaramita
3+阅读 · 2018年9月27日
FuzzerGym: A Competitive Framework for Fuzzing and Learning
William Drozd,Michael D. Wagner
3+阅读 · 2018年7月19日
A Multi-Objective Deep Reinforcement Learning Framework
Thanh Thi Nguyen
9+阅读 · 2018年6月27日
Yaodong Yang,Rui Luo,Minne Li,Ming Zhou,Weinan Zhang,Jun Wang
3+阅读 · 2018年6月12日
Ermo Wei,Drew Wicke,David Freelan,Sean Luke
10+阅读 · 2018年4月25日
Sergio Valcarcel Macua,Aleksi Tukiainen,Daniel García-Ocaña Hernández,David Baldazo,Enrique Munoz de Cote,Santiago Zazo
4+阅读 · 2018年4月22日
Daniel Gordon,Aniruddha Kembhavi,Mohammad Rastegari,Joseph Redmon,Dieter Fox,Ali Farhadi
5+阅读 · 2018年4月5日
Jinsoo Choi,Tae-Hyun Oh,In So Kweon
4+阅读 · 2018年3月1日
Matthias Plappert,Rein Houthooft,Prafulla Dhariwal,Szymon Sidor,Richard Y. Chen,Xi Chen,Tamim Asfour,Pieter Abbeel,Marcin Andrychowicz
3+阅读 · 2018年1月31日
相关VIP内容
Fariz Darari简明《博弈论Game Theory》介绍,35页ppt
专知会员服务
73+阅读 · 2020年5月15日
因果图,Causal Graphs,52页ppt
专知会员服务
156+阅读 · 2020年4月19日
专知会员服务
113+阅读 · 2020年2月1日
Stabilizing Transformers for Reinforcement Learning
专知会员服务
30+阅读 · 2019年10月17日
《DeepGCNs: Making GCNs Go as Deep as CNNs》
专知会员服务
23+阅读 · 2019年10月17日
强化学习最新教程,17页pdf
专知会员服务
76+阅读 · 2019年10月11日
相关资讯
Hierarchically Structured Meta-learning
CreateAMind
13+阅读 · 2019年5月22日
动物脑的好奇心和强化学习的好奇心
CreateAMind
7+阅读 · 2019年1月26日
强化学习的Unsupervised Meta-Learning
CreateAMind
7+阅读 · 2019年1月7日
Unsupervised Learning via Meta-Learning
CreateAMind
32+阅读 · 2019年1月3日
RL 真经
CreateAMind
4+阅读 · 2018年12月28日
A Technical Overview of AI & ML in 2018 & Trends for 2019
待字闺中
10+阅读 · 2018年12月24日
spinningup.openai 强化学习资源完整
CreateAMind
4+阅读 · 2018年12月17日
Reinforcement Learning: An Introduction 2018第二版 500页
CreateAMind
9+阅读 · 2018年4月27日
强化学习族谱
CreateAMind
11+阅读 · 2017年8月2日
强化学习 cartpole_a3c
CreateAMind
9+阅读 · 2017年7月21日
Top
微信扫码咨询专知VIP会员