IJCAI2021人工智能顶会强化学习论文中英对照简单整理(30篇)

IJCAI2021人工智能顶会强化学习论文中英对照简单整理(30篇)

主要内容(Main Track):

  • Agent-based and Multi-agent Systems

1.Mean Field Games Flock! The Reinforcement Learning Way

We present a method enabling a large number of agents to learn how to flock. This problem has drawn a lot of interest but requires many structural assumptions and is tractable only in small dimensions. We phrase this problem as a Mean Field Game (MFG), where each individual chooses its own acceleration depending on the population behavior. Combining Deep Reinforcement Learning (RL) and Normalizing Flows (NF), we obtain a tractable solution requiring only very weak assumptions. Our algorithm finds a Nash Equilibrium and the agents adapt their velocity to match the neighboring flock’s average one. We use Fictitious Play and alternate: (1) computing an approximate best response with Deep RL, and (2) estimating the next population distribution with NF. We show numerically that our algorithm can learn multi-group or high-dimensional flocking with obstacles.

我们提出了一种方法,使大量的代理学习如何蜂拥而至。这个问题引起了人们的极大兴趣,但需要许多结构假设,而且只能从小方面进行。我们把这个问题说成是平均现场游戏(MFG),其中每个人根据人口行为选择自己的加速度。结合深度强化学习 (RL) 和正常化流 (NF),我们获得了一个可调和的解决方案,只需要非常薄弱的假设。我们的算法找到了纳什平衡,代理调整其速度以匹配相邻羊群的平均速度。我们使用虚构的播放和交替: (1) 计算深RL的近似最佳响应,(2) 估计 NF 的下一个人口分布。我们从数字上表明,我们的算法可以学习多组或高维的障碍物群。

2.Reducing Bus Bunching with Asynchronous Multi-Agent Reinforcement Learning

The bus system is a critical component of sustainable urban transportation. However, due to the significant uncertainties in passenger demand and traffic conditions, bus operation is unstable in nature and bus bunching has become a common phenomenon that undermines the reliability and efficiency of bus services. Despite recent advances in multi-agent reinforcement learning (MARL) on traffic control, little research has focused on bus fleet control due to the tricky asynchronous characteristic---control actions only happen when a bus arrives at a bus stop and thus agents do not act simultaneously. In this study, we formulate route-level bus fleet control as an asynchronous multi-agent reinforcement learning (ASMR) problem and extend the classical actor-critic architecture to handle the asynchronous issue. Specifically, we design a novel critic network to effectively approximate the marginal contribution for other agents, in which graph attention neural network is used to conduct inductive learning for policy evaluation. The critic structure also helps the ego agent optimize its policy more efficiently. We evaluate the proposed framework on real-world bus services and actual passenger demand derived from smart card data. Our results show that the proposed model outperforms both traditional headway-based control methods and existing MARL methods.

公交系统是可持续城市交通的重要组成部分。但是,由于客运需求和交通状况存在重大不确定性,公交运营性质不稳定,公车拼接已成为损害公交服务可靠性和效率的普遍现象。尽管最近在交通管制方面的多代理强化学习 (MARL) 取得了进展,但由于棘手的异步特征,几乎没有研究侧重于公交车队控制---控制行动只有在公共汽车到达公共汽车站时才会发生,因此代理不会同时行动。在本研究中,我们把路线级公交车队控制作为异步多代理强化学习(ASMR)问题,并扩展经典演员-评论家架构来处理异步问题。具体来说,我们设计了一个新颖的批评家网络,以有效地近似于其他代理的边际贡献,其中图形注意力神经网络用于进行政策评估的感应学习。批评家结构也有助于自我代理人更有效地优化其政策。我们评估了基于智能卡数据的实际巴士服务和实际乘客需求的拟议框架。我们的结果表明,建议的模型既优于传统的基于前行的控制方法,也优于现有的 MARL 方法。

3.Data-Efficient Reinforcement Learning for Malaria Control

Sequential decision-making under cost-sensitive tasks is prohibitively daunting, especially for the problem that has a significant impact on people's daily lives, such as malaria control, treatment recommendation. The main challenge faced by policymakers is to learn a policy from scratch by interacting with a complex environment in a few trials. This work introduces a practical, data-efficient policy learning method, named Variance-Bonus Monte Carlo Tree Search~(VB-MCTS), which can copy with very little data and facilitate learning from scratch in only a few trials. Specifically, the solution is a model-based reinforcement learning method. To avoid model bias, we apply Gaussian Process~(GP) regression to estimate the transitions explicitly. With the GP world model, we propose a variance-bonus reward to measure the uncertainty about the world. Adding the reward to the planning with MCTS can result in more efficient and effective exploration. Furthermore, the derived polynomial sample complexity indicates that VB-MCTS is sample efficient. Finally, outstanding performance on a competitive world-level RL competition and extensive experimental results verify its advantage over the state-of-the-art on the challenging malaria control task.

在成本敏感任务下的连续决策令人望而生畏,特别是对于对人们日常生活有重大影响的问题,如疟疾控制、治疗建议。决策者面临的主要挑战是,通过在几个试验中与复杂的环境互动,从零开始学习政策。这项工作引入了一种实用的、数据高效的政策学习方法,名为"方差-奖金蒙特卡洛树搜索"(VB-MCTS),它只需很少的数据即可复制,并仅在少数试验中便于从零开始学习。具体来说,解决方案是基于模型的强化学习方法。为了避免模型偏差,我们应用高斯流程+(GP)回归来明确估计过渡。与GP世界模型,我们提出了一个差异奖金奖励,以衡量世界的不确定性。通过 MCTS 将奖励添加到规划中,可以带来更高效、更有效的探索。此外,衍生的多名额样品复杂性表明 VB-MCTS 是样品效率高的。最后,在竞争激烈的世界级 RL 竞赛中的出色表现和广泛的实验结果验证了其在具有挑战性的疟疾控制任务方面比最先进的优势。‎

AI Ethics, Trust, Fairness

4.Multi-Objective Reinforcement Learning for Designing Ethical Environments

AI research is being challenged with ensuring that autonomous agents learn to behave ethically, namely in alignment with moral values. A common approach, founded on the exploitation of Reinforcement Learning techniques, is to design environments that incentivise agents to behave ethically. However, to the best of our knowledge, current approaches do not theoretically guarantee that an agent will learn to behave ethically. Here, we make headway along this direction by proposing a novel way of designing environments wherein it is formally guaranteed that an agent learns to behave ethically while pursuing its individual objectives. Our theoretical results develop within the formal framework of Multi-Objective Reinforcement Learning to ease the handling of an agent's individual and ethical objectives. As a further contribution, we leverage on our theoretical results to introduce an algorithm that automates the design of ethical environments

‎人工智能研究面临的挑战是确保自主代理学会道德行为,即与道德价值观一致。一种基于强化学习技术开发的常见方法是设计环境,激励代理进行合乎道德的行为。然而,据我们所知,目前的方法在理论上并不能保证代理人会学会道德行为。在这里,我们提出了一种新的设计环境的方法,正式保证代理在追求其个人目标的同时学会道德行为,从而朝着这个方向前进。我们的理论成果在多目标强化学习的正式框架内发展,以简化代理的个人和道德目标的处理。作为进一步的贡献,我们利用我们的理论成果引入一种算法,使道德环境的设计自动化‎

Knowledge Representation and Reasoning

5.Efficient PAC Reinforcement Learning in Regular Decision Processes

Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.

最近,人们提出定期决策过程,作为非马尔科夫决策过程的一种表现良好的形式。定期决策过程的特点是过渡函数和奖励函数取决于整个历史,但定期(如常规语言)。在实践中,过渡和奖励功能都可以被看作是有限的传感器。我们在常规决策过程中学习强化学习。我们的主要贡献是表明,在描述基本决策过程的一组参数中,PAC 可以在多名制时间内学习接近最佳的政策。我们认为,已确定的参数集极小,它合理地抓住了常规决策过程的难度。

Machine Learning

6.Deep Reinforcement Learning for Navigation in AAA Video Games

In video games, \non-player characters (NPCs) are used to enhance the players' experience in a variety of ways, e.g., as enemies, allies, or innocent bystanders. A crucial component of NPCs is navigation, which allows them to move from one point to another on the map. The most popular approach for NPC navigation in the video game industry is to use a navigation mesh (NavMesh), which is a graph representation of the map, with nodes and edges indicating traversable areas. Unfortunately, complex navigation abilities that extend the character's capacity for movement, e.g., grappling hooks, jetpacks, teleportation, or double-jumps, increase the complexity of the NavMesh, making it intractable in many practical scenarios. Game designers are thus constrained to only add abilities that can be handled by a NavMesh. As an alternative to the NavMesh, we propose to use Deep Reinforcement Learning (Deep RL) to learn how to navigate 3D maps in video games using any navigation ability. We test our approach on complex 3D environments that are notably an order of magnitude larger than maps typically used in the Deep RL literature. One of these environments is from a recently released AAA video game called Hyper Scape. We find that our approach performs surprisingly well, achieving at least 90% success rate in a variety of scenarios using complex navigation abilities.

在视频游戏中,非玩家角色 (NPCs) 用于以各种方式增强玩家的体验,例如,作为敌人、盟友或无辜的旁观者。NPC 的一个关键组成部分是导航,它允许它们从地图上的一个点移动到另一个点。在视频游戏行业中,NPC 导航最常用的方法是使用导航网(NavMesh),它是地图的图形表示,节点和边缘表示可穿越区域。不幸的是,复杂的导航能力,扩大字符的运动能力,例如,抓钩,喷气背包,传送,或双跳,增加了NavMesh的复杂性,使其难以在许多实际情况下。因此,游戏设计师只能添加可由 NavMesh 处理的能力。作为 NavMesh 的替代方案,我们建议使用深度强化学习 (深度 RL) 来学习如何使用任何导航功能在视频游戏中导航 3D 地图。我们在复杂的 3D 环境中测试我们的方法,这些环境明显比深度 RL 文献中通常使用的地图大一个数量级。这些环境之一来自最近发布的AAA视频游戏称为超级景观。我们发现,我们的方法性能出奇的好,使用复杂的导航能力在各种场景中至少实现了 90% 的成功率。

7.Verifying Reinforcement Learning up to Infinity

Formally verifying that reinforcement learning systems act safely is increasingly important, but existing methods only verify over finite time. This is of limited use for dynamical systems that run indefinitely. We introduce the first method for verifying the time-unbounded safety of neural networks controlling dynamical systems. We develop a novel abstract interpretation method which, by constructing adaptable template-based polyhedra using MILP and interval arithmetic, yields sound---safe and invariant---overapproximations of the reach set. This provides stronger safety guarantees than previous time-bounded methods and shows whether the agent has generalised beyond the length of its training episodes. Our method supports ReLU activation functions and systems with linear, piecewise linear and non-linear dynamics defined with polynomial and transcendental functions. We demonstrate its efficacy on a range of benchmark control problems.

正式验证强化学习系统是否安全运行变得越来越重要,但现有方法只能在有限的时间内验证。对于无限期运行的动态系统,这种应用有限。我们介绍了第一种验证神经网络控制动态系统无时限安全性的方法。我们开发了一种新的抽象解释方法,通过使用 MILP 和间隔算术构建适应性模板基聚海德拉,产生声音---安全性和不变---覆盖集的近似。这提供了比之前限定时间的方法更有力的安全保证,并显示代理是否已超过其培训情节的长度。我们的方法支持ReLU激活功能和系统与线性,零碎的线性和非线性动力学定义与多名制和超然功能。我们展示了其在一系列基准控制问题的有效性。

8.Robustly Learning Composable Options in Deep Reinforcement Learning

Hierarchical reinforcement learning (HRL) is only effective for long-horizon problems when high-level skills can be reliably sequentially executed. Unfortunately, learning reliably composable skills is difficult, because all the components of every skill are constantly changing during learning. We propose three methods for improving the composability of learned skills: representing skill initiation regions using a combination of pessimistic and optimistic classifiers; learning re-targetable policies that are robust to non-stationary subgoal regions; and learning robust option policies using model-based RL. We test these improvements on four sparse-reward maze navigation tasks involving a simulated quadrupedal robot. Each method successively improves the robustness of a baseline skill discovery method, substantially outperforming state-of-the-art flat and hierarchical methods.

分层强化学习 (HRL) 只有在高水平技能能够可靠地按顺序执行时,才能有效解决长期问题。不幸的是,学习可靠的可构成技能是困难的,因为每个技能的所有组成部分都在学习过程中不断变化。我们提出了提高学习技能的可组成性的三种方法:使用悲观和乐观的分类器来表示技能启动区域:学习对非固定次目标区域稳健的可重新定位政策;使用基于模型的 RL 学习稳健的选项策略。我们在涉及模拟四足机器人的四个稀疏奖励迷宫导航任务上测试这些改进。每种方法都不断提高基线技能发现方法的稳健性,大大优于最先进的扁平和分层方法。

9.Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a First-person Simulated 3D Environment

Learning how to execute complex tasks involving multiple objects in a 3D world is challenging when there is no ground-truth information about the objects or any demonstration to learn from. When an agent only receives a signal from task-completion, this makes it challenging to learn the object-representations which support learning the correct object-interactions needed to complete the task. In this work, we formulate learning an attentive object dynamics model as a classification problem, using random object-images to define incorrect labels for our object-dynamics model. We show empirically that this enables object-representation learning that captures an object's category (is it a toaster?), its properties (is it on?), and object-relations (is something inside of it?). With this, our core learner (a relational RL agent) receives the dense training signal it needs to rapidly learn object-interaction tasks. We demonstrate results in the 3D AI2Thor simulated kitchen environment with a range of challenging food preparation tasks. We compare our method's performance to several related approaches and against the performance of an oracle: an agent that is supplied with ground-truth information about objects in the scene. We find that our agent achieves performance closest to the oracle in terms of both learning speed and maximum success rate.

在没有关于对象或任何演示的地面真实信息或任何演示需要学习的情况下,学习如何执行涉及 3D 世界中多个对象的复杂任务是具有挑战性的。当代理仅接收任务完成时的信号时,这就使得学习支持学习完成任务所需的正确对象相互作用的对象表示变得具有挑战性。在这项工作中,我们将学习细心的对象动态模型作为分类问题,使用随机对象图像来定义对象动力学模型的错误标签。我们的经验表明,这使对象表示学习能够捕获对象的类别(是烤机吗?),其属性(是否打开?)和对象关系(其中的东西吗?有了这个,我们的核心学习者(关系RL代理)收到密集的训练信号,它需要快速学习对象交互任务。我们在 3D AI2Thor 模拟厨房环境中演示了一系列具有挑战性的食物准备任务。我们将我们的方法的性能与几个相关的方法进行比较,并与预言家的性能进行比较:一种提供有关现场物体的地面真实信息的代理。我们发现,我们的代理在学习速度和最大成功率方面都达到最接近预言家的性能。

10.Deep Reinforcement Learning for Multi-contact Motion Planning of Hexapod Robots

Legged locomotion in a complex environment requires careful planning of the footholds of legged robots. In this paper, a novel Deep Reinforcement Learning (DRL) method is proposed to implement multi-contact motion planning for hexapod robots moving on uneven plum-blossom piles. First, the motion of hexapod robots is formulated as a Markov Decision Process (MDP) with a specified reward function. Second, a transition feasibility model is proposed for hexapod robots, which describes the feasibility of the state transition under the condition of satisfying kinematics and dynamics, and in turn determines the rewards. Third, the footholds and Center-of-Mass (CoM) sequences are sampled from a diagonal Gaussian distribution and the sequences are optimized through learning the optimal policies using the designed DRL algorithm. Both of the simulation and experimental results on physical systems demonstrate the feasibility and efficiency of the proposed method. Videos are shown at videoviewpage.wixsite.com.

在复杂的环境中,腿部运动需要仔细规划腿部机器人的立足点。本文提出了一种新的深度强化学习(DRL)方法,对在不平坦的梅花堆上移动的六足动物机器人实施多接触运动规划。首先,六足动物机器人的运动被制定为具有特定奖励功能的马尔科夫决策过程 (MDP)。其次,提出了六足机器人的过渡可行性模型,描述了在满足运动学和动力学条件下国家过渡的可行性,并反过来决定了回报。第三,从对角线高斯分布中取样站点和质量中心 (CoM) 序列,并通过使用设计的 DRL 算法学习最佳策略来优化序列。物理系统的模拟和实验结果都证明了拟议方法的可行性和效率。视频在 videoviewpage.wixsite.com 播放。

11.Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes

In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction

在许多实际强化学习 (RL) 问题中,除了最大限度地提高目标外,学习代理还必须保持一些必要的安全约束。我们将学习安全策略的问题制定为具有未知过渡概率矩阵的无限视野折扣约束 Markov 决策过程 (CMDP),其中安全要求被建模为对预期累积成本的限制。我们提出了两种基于模型的受约束强化学习 (CRL) 算法,用于学习安全策略,即 (i) GM-CRL 算法(该算法可以访问生成模型)和 (ii) UC-CRL 算法,该算法使用上置信度风格的在线探索方法学习模型。我们描述这些算法的样本复杂性,即在客观最大化和约束满意度方面,确保高概率的精确度所需的样本数量。

12.Reinforcement Learning for Route Optimization with Robustness Guarantees

Application of deep learning to NP-hard combinatorial optimization problems is an emerging research trend, and a number of interesting approaches have been published over the last few years. In this work we address robust optimization, which is a more complex variant where a max-min problem is to be solved. We obtain robust solutions by solving the inner minimization problem exactly and apply Reinforcement Learning to learn a heuristic for the outer problem. The minimization term in the inner objective represents an obstacle to existing RL-based approaches, as its value depends on the full solution in a non-linear manner and cannot be evaluated for partial solutions constructed by the agent over the course of each episode. We overcome this obstacle by defining the reward in terms of the one-step advantage over a baseline policy whose role can be played by any fast heuristic for the given problem. The agent is trained to maximize the total advantage, which, as we show, is equivalent to the original objective. We validate our approach by solving min-max versions of standard benchmarks for the Capacitated Vehicle Routing and the Traveling Salesperson Problem, where our agents obtain near-optimal solutions and improve upon the baselines

将深度学习应用于NP硬组合优化问题是一个新兴的研究趋势,在过去几年中已经发表了一些有趣的方法。在这项工作中,我们处理强大的优化,这是一个更复杂的变种,其中最大分钟的问题要解决。我们通过准确解决内部最小化问题,并应用强化学习来学习外在问题的启发性方法,从而获得强有力的解决方案。内在目标中的最小化术语是现有基于 RL 的方法的障碍,因为它的价值取决于以非线性方式提供的完整解决方案,并且无法评估代理在每个插曲过程中构建的部分解决方案。我们克服了这一障碍,从一步到位的优势来界定对基线政策的奖励,而基线政策的作用可以通过任何快速启发性治疗给定问题来发挥。代理经过培训,以最大限度地发挥整体优势,正如我们所表明的,这相当于原始目标。我们通过解决电容式车辆路线和旅行销售人员问题的标准基准最小版本来验证我们的方法,我们的代理商通过这些基准获得近乎最佳的解决方案并改进基线。

13.Average-Reward Reinforcement Learning with Trust Region Methods


Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.

大多数强化学习算法优化了折扣标准,有利于加速收敛,减少估计值的差异。虽然折扣标准适用于某些任务,如财务相关问题,但许多工程问题对未来奖励一视同仁,更喜欢长期平均标准。本文以长期平均标准研究强化学习问题。首先,我们开发了统一的信任区域理论,具有贴现和平均标准。以平均标准计算,信任区域内的新性能以扰动分析 (PA) 理论为源。其次,我们提出了一种名为"平均策略优化(APO)"的实用算法,它通过一种名为"平均值约束"的新技术来改进价值估计。据我们所知,我们的工作是第一个以平均标准研究信任区方法的工作,它补充了超出折扣标准的强化学习框架。最后,在MJOCo的连续控制环境中进行实验。在大多数任务中,APO 的表现优于折扣 PPO,这显示了我们方法的有效性。

14.Meta-Reinforcement Learning by Tracking Task Non-stationarity

Many real-world domains are subject to a structured non-stationarity which affects the agent's goals and the environmental dynamics. Meta-reinforcement learning (RL) has been shown successful for training agents that quickly adapt to related tasks. However, most of the existing meta-RL algorithms for non-stationary domains either make strong assumptions on the task generation process or require sampling from it at training time. In this paper, we propose a novel algorithm (TRIO) that optimizes for the future by explicitly tracking the task evolution through time. At training time, TRIO learns a variational module to quickly identify latent parameters from experience samples. This module is learned jointly with an optimal exploration policy that takes task uncertainty into account. At test time, TRIO tracks the evolution of the latent parameters online, hence reducing the uncertainty over future tasks and obtaining fast adaptation through the meta-learned policy. Unlike most existing methods, TRIO does not assume Markovian task-evolution processes, it does not require information about the non-stationarity at training time, and it captures complex changes undergoing in the environment. We evaluate our algorithm on different simulated problems and show it outperforms competitive baselines.

‎许多真实世界的域受到结构化非固定性的约束,从而影响代理的目标和环境动态。元强化学习 (RL) 已证明对于快速适应相关任务的培训人员是成功的。但是,大多数用于非固定域的现有元-RL 算法要么对任务生成过程做出强假设,要么在培训时间需要从任务生成过程中进行取样。在本文中,我们提出了一种新算法 (TRIO),通过明确跟踪任务演变的时间来优化未来。在培训时间,TRIO 学习变体模块,以便快速识别经验样本中的潜在参数。本模块与考虑到任务不确定性的最佳勘探策略共同学习。在测试时间,TRIO 在线跟踪潜在参数的演变,从而减少未来任务的不确定性,并通过元学习策略快速适应。与大多数现有方法不同,TRIO 不承担马尔科维安任务演变过程,它不需要有关培训时间非固定性的信息,它捕获了环境中正在经历的复杂变化。我们评估不同模拟问题的算法,并显示其优于竞争基线。‎

15.Multi-Agent Reinforcement Learning for Automated Peer-to-Peer Energy Trading in Double-Side Auction Market

With increasing prosumers employed with distributed energy resources (DER), advanced energy management has become increasingly important. To this end, integrating demand-side DER into electricity market is a trend for future smart grids. The double-side auction (DA) market is viewed as a promising peer-to-peer (P2P) energy trading mechanism that enables interactions among prosumers in a distributed manner. To achieve the maximum profit in a dynamic electricity market, prosumers act as price makers to simultaneously optimize their operations and trading strategies. However, the traditional DA market is difficult to be explicitly modelled due to its complex clearing algorithm and the stochastic bidding behaviors of the participants. For this reason, in this paper we model this task as a multi-agent reinforcement learning (MARL) problem and propose an algorithm called DA-MADDPG that is modified based on MADDPG by abstracting the other agents’ observations and actions through the DA market public information for each agent’s critic. The experiments show that 1) prosumers obtain more economic benefits in P2P energy trading w.r.t. the conventional electricity market independently trading with the utility company; and 2) DA-MADDPG performs better than the traditional Zero Intelligence (ZI) strategy and the other MARL algorithms, e.g., IQL, IDDPG, IPPO and MADDPG.

随着分布式能源资源(DER)使用量的增加,先进的能源管理变得越来越重要。为此,将需求方 DER 整合到电力市场是未来智能电网的趋势。双面拍卖 (DA) 市场被视为一种很有前途的点对点 (P2P) 能源交易机制,能够以分布式方式实现假肢之间的互动。为了在动态电力市场实现最大利润,假发器充当价格制定者,同时优化其运营和交易策略。然而,传统的DA市场由于其复杂的清算算法和参与者的随机竞价行为而难以明确建模。因此,本文将此任务建模为多代理强化学习 (MARL) 问题,并提出了一种称为 DA-MADDPG 的算法,该算法基于 MADDPG 进行修改,通过 DA 市场公共信息为每个代理的批评者抽象其他代理的观察和行动。实验表明,在P2P能源交易中,与公用事业公司独立交易的传统电力市场获得了较多的经济效益:2) DA-MADPG 比传统的零智能 (ZI) 策略和其他 MARL 算法(例如 IQL、IDDPG、IPPO 和 MADDPG)性能更好。‎

16.Reinforcement Learning Based Sparse Black-box Adversarial Attack on Video Recognition Models

We explore the black-box adversarial attack on video recognition models. Attacks are only performed on selected key regions and key frames to reduce the high computation cost of searching adversarial perturbations on a video due to its high dimensionality. To select key frames, one way is to use heuristic algorithms to evaluate the importance of each frame and choose the essential ones. However, it is time inefficient on sorting and searching. In order to speed up the attack process, we propose a reinforcement learning based frame selection strategy. Specifically, the agent explores the difference between the original class and the target class of videos to make selection decisions. It receives rewards from threat models which indicate the quality of the decisions. Besides, we also use saliency detection to select key regions and only estimate the sign of gradient instead of the gradient itself in zeroth order optimization to further boost the attack process. We can use the trained model directly in the untargeted attack or with little fine-tune in the targeted attack, which saves computation time. A range of empirical results on real datasets demonstrate the effectiveness and efficiency of the proposed method.

‎我们探索视频识别模型的黑匣子对抗攻击。攻击仅在选定的关键区域和关键帧上执行,以降低由于视频的高维度而在视频上搜索对抗扰动的高计算成本。要选择密钥帧,一种方法是使用启发式算法来评估每个帧的重要性并选择基本帧。但是,在排序和搜索方面效率低下是时间问题。为了加快攻击过程,我们提出了基于强化学习的框架选择策略。具体来说,代理会探索原始类和视频目标类之间的区别,以做出选择决策。它从指示决策质量的威胁模型获得奖励。此外,我们还使用定位检测来选择关键区域,只估计梯度标志,而不是梯度本身的零顺序优化,以进一步推动攻击过程。我们可以直接在未定向攻击中使用训练有素的模型,或在目标攻击中很少微调,从而节省计算时间。关于真实数据集的一系列实证结果显示了拟议方法的有效性和效率。‎

17.Deep Reinforcement Learning Boosted Partial Domain Adaptation

Domain adaptation is critical for learning transferable features that effectively reduce the distribution difference among domains. In the era of big data, the availability of large-scale labeled datasets motivates partial domain adaptation (PDA) which deals with adaptation from large source domains to small target domains with less number of classes. In the PDA setting, it is crucial to transfer relevant source samples and eliminate irrelevant ones to mitigate negative transfer. In this paper, we propose a deep reinforcement learning based source data selector for PDA, which is capable of eliminating less relevant source samples automatically to boost existing adaptation methods. It determines to either keep or discard the source instances based on their feature representations so that more effective knowledge transfer across domains can be achieved via filtering out irrelevant samples. As a general module, the proposed DRL-based data selector can be integrated into any existing domain adaptation or partial domain adaptation models. Extensive experiments on several benchmark datasets demonstrate the superiority of the proposed DRL-based data selector which leads to state-of-the-art performance for various PDA tasks

域适应对于学习可转移功能,有效减少域之间的分布差异至关重要。在大数据时代,大规模标记数据集的可用性促使部分域位适应 (PDA) 处理从大源域到类别较少的小目标域的适应。在 PDA 设置中,必须转移相关源样本并消除不相关的样本,以减轻负转移。本文提出了PDA的深度强化学习源数据选择器,它能够自动消除相关源样本,从而推动现有的适应方法。它根据其功能表示确定保留或丢弃源实例,以便通过筛选出不相关的示例实现跨域的更有效的知识转移。作为一般模块,建议的基于 DRL 的数据选择器可以集成到任何现有域适应或部分域适应模型中。对多个基准数据集的广泛实验证明了基于 DRL 的数据选择器的优越性,该选择器为各种 PDA 任务带来最先进的性能。

18.Non-decreasing Quantile Function Network with Efficient Exploration for Distributional Reinforcement Learning

Although distributional reinforcement learning (DRL) has been widely examined in the past few years, there are two open questions people are still trying to address. One is how to ensure the validity of the learned quantile function, the other is how to efficiently utilize the distribution information. This paper attempts to provide some new perspectives to encourage the future in-depth studies in these two fields. We first propose a non-decreasing quantile function network (NDQFN) to guarantee the monotonicity of the obtained quantile estimates and then design a general exploration framework called distributional prediction error (DPE) for DRL which utilizes the entire distribution of the quantile function. In this paper, we not only discuss the theoretical necessity of our method but also show the performance gain it achieves in practice by comparing with some competitors on Atari 2600 Games especially in some hard-explored games.

值分布强化学习在往年被广泛应用,但还有两个问题尚待解决:如何保证学到的分位数函数的合理性;如何高效地利用分布信息.我们首次提出一种非递减地分位数函数网络(NDQFN)来保证得到的分位数估计地单调性.然后设计了一种通用的探索框架,称之为分布式预测误差,这里用了分位数函数地整个分布本文不仅论述了方法的理论必要性,而且通过与阿塔里2600游戏的一些竞争对手,特别是一些艰苦探索的游戏相比,展示了其在实践中取得的成绩。

Machine Learning Applications

19.Ordering-Based Causal Discovery with Reinforcement Learning

It is a long-standing question to discover causal relations among a set of variables in many empirical sciences. Recently, Reinforcement Learning (RL) has achieved promising results in causal discovery from observational data. However, searching the space of directed graphs and enforcing acyclicity by implicit penalties tend to be inefficient and restrict the existing RL-based method to small scale problems. In this work, we propose a novel RL-based approach for causal discovery, by incorporating RL into the ordering-based paradigm. Specifically, we formulate the ordering search problem as a multi-step Markov decision process, implement the ordering generating process with an encoder-decoder architecture, and finally use RL to optimize the proposed model based on the reward mechanisms designed for each ordering. A generated ordering would then be processed using variable selection to obtain the final causal graph. We analyze the consistency and computational complexity of the proposed method, and empirically show that a pretrained model can be exploited to accelerate training. Experimental results on both synthetic and real data sets shows that the proposed method achieves a much improved performance over existing RL-based method.

在许多经验科学中发现一组变量之间的因果关系是一个长期存在的问题。最近,强化学习(RL)从观测数据中发现了因果关系,取得了可喜的成果。但是,搜索定向图形的空间并通过隐性处罚强制执行循环往往效率低下,并将现有的基于 RL 的方法限制为小规模问题。在这项研究中,我们提出了一种新的基于RL的因果发现方法,将RL纳入基于顺序的范式中。具体来说,我们将订购搜索问题制定为多步骤 Markov 决策过程,使用编码器解码器架构实施订购生成过程,最后使用 RL 根据为每个订单设计的奖励机制优化建议的模型。然后,将使用可变选择处理生成的订单,以获取最终因果图。我们分析了建议方法的一致性和计算复杂性,并从经验上证明可以利用预训模型来加速训练。合成和真实数据集的实验结果表明,与现有的基于RL的方法比较,该方法的性能有了很大的提高。

20.Boosting Offline Reinforcement Learning with Residual Generative Modeling

Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration.Current offline RL research includes: 1) generative modeling, i.e., approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game, Honor of Kings.

离线强化学习 (RL) 尝试在没有在线探索的情况下,通过记录的离线体验来学习近乎最佳的政策。当前离线 RL 研究包括:1) 生成建模,即使用固定数据近似于策略:2) 学习国家行动价值函数。虽然大多数研究都侧重于状态作用函数部分,通过减少训练数据分布变化引起的值函数近似的引导误差,但生成建模中错误传播的影响却被忽视了。本文分析了生成建模中的错误。我们建议使用 AQL(行动条件 Q 学习),这是一种残余生成模型,用于减少离线 RL 的政策近似误差。我们表明,我们的方法可以在不同的基准数据集中学习更准确的政策近似。此外,我们表明,建议的离线 RL 方法可以在多人在线战场 (MOBA) 游戏"王者荣耀"下学习更具竞争力的 AI 代理,执行复杂的控制任务。

Multidisciplinary Topics and Applications

21.Dynamic Lane Traffic Signal Control with Group Attention and Multi-Timescale Reinforcement Learning

Traffic signal control has achieved significant success with the development of reinforcement learning. However, existing works mainly focus on intersections with normal lanes with fixed outgoing directions. It is noticed that some intersections actually implement dynamic lanes, in addition to normal lanes, to adjust the outgoing directions dynamically. Existing methods fail to coordinate the control of traffic signal and that of dynamic lanes effectively. In addition, they lack proper structures and learning algorithms to make full use of traffic flow prediction, which is essential to set the proper directions for dynamic lanes. Motivated by the ineffectiveness of existing approaches when controlling the traffic signal and dynamic lanes simultaneously, we propose a new method, namely MT-GAD, in this paper. It uses a group attention structure to reduce the number of required parameters and to achieve a better generalizability, and uses multi-timescale model training to learn proper strategy that could best control both the traffic signal and the dynamic lanes. The experiments on real datasets demonstrate that MT-GAD outperforms existing approaches significantly.

交通信号控制在加强学习方面取得了显著成效。然而,现有的工程主要集中在与有固定出港方向的正常车道的交叉口。值得注意的是,一些交叉路口实际上实施动态车道,除了正常车道,动态调整出路方向。现有方法未能有效地协调交通信号和动态车道的控制。此外,它们缺乏适当的结构和学习算法来充分利用交通流量预测,这对于为动态车道设置正确的方向至关重要。在同时控制交通信号和动态车道时,由于现有方法无效,我们在本文中提出了一种新的方法,即MT-GAD。它使用群注意力结构来减少所需参数的数量并实现更好的通用性,并使用多时间尺度模型培训来学习能够最好地控制交通信号和动态车道的正确策略。对真实数据集的实验表明,MT-GAD 明显优于现有方法。‎

22.BACKDOORL: Backdoor Attack against Competitive Reinforcement Learning

Recent research has confirmed the feasibility of backdoor attacks in deep reinforcement learning (RL) systems. However, the existing attacks require the ability to arbitrarily modify an agent's observation, constraining the application scope to simple RL systems such as Atari games. In this paper, we migrate backdoor attacks to more complex RL systems involving multiple agents and explore the possibility of triggering the backdoor without directly manipulating the agent's observation. As a proof of concept, we demonstrate that an adversary agent can trigger the backdoor of the victim agent with its own action in two-player competitive RL systems. We prototype and evaluate BackdooRL in four competitive environments. The results show that when the backdoor is activated, the winning rate of the victim drops by 17% to 37% compared to when not activated. The videos are hosted at github.com/wanglun1996/.

最近的研究证实了后门攻击在深度强化学习 (RL) 系统中的可行性。但是,现有的攻击需要能够任意修改代理的观察,将应用范围限制在简单的 RL 系统(如 Atari 游戏)中。在本文中,我们将后门攻击迁移到涉及多个代理的更复杂的 RL 系统,并探索在不直接操纵代理观察的情况下触发后门的可能性。作为概念证明,我们证明,对手代理可以在双人竞争 RL 系统中以自己的行动触发受害者代理的后门。我们在四个竞争环境中原型和评估 BackdooRL。结果表明,当后门被激活时,受害者的中奖率比未激活时下降了17%至37%。视频在 github.com/wanglun1996/ 托管。

23.Objective-aware Traffic Simulation via Inverse Reinforcement Learning

Traffic simulators act as an essential component in the operating and planning of transportation systems. Conventional traffic simulators usually employ a calibrated physical car-following model to describe vehicles' behaviors and their interactions with traffic environment. However, there is no universal physical model that can accurately predict the pattern of vehicle's behaviors in different situations. A fixed physical model tends to be less effective in a complicated environment given the non-stationary nature of traffic dynamics. In this paper, we formulate traffic simulation as an inverse reinforcement learning problem, and propose a parameter sharing adversarial inverse reinforcement learning model for dynamics-robust simulation learning. Our proposed model is able to imitate a vehicle's trajectories in the real world while simultaneously recovering the reward function that reveals the vehicle's true objective which is invariant to different dynamics. Extensive experiments on synthetic and real-world datasets show the superior performance of our approach compared to state-of-the-art methods and its robustness to variant dynamics of traffic.

交通模拟器是运输系统运行和规划的重要组成部分。传统的交通模拟器通常采用校准的物理汽车跟随模型来描述车辆的行为及其与交通环境的相互作用。然而,没有通用的物理模型可以准确预测车辆在不同情况下的行为模式。鉴于交通动态的非固定性质,固定物理模型在复杂环境中往往效果较差。本文将交通模拟作为反强化学习问题,提出了动态-强健模拟学习的参数共享对抗逆增强学习模型。我们建议的模型能够模仿车辆在现实世界中的轨迹,同时恢复奖励功能,揭示车辆的真实目标,这是不变的不同动态。对合成和真实数据集的广泛实验表明,与最先进的方法相比,我们的方法性能优越,并且对流量的变异动态具有稳健性。‎

调研报告(Survey Track)

24.Policy Learning with Constraints in Model-free Reinforcement Learning: A Survey

Reinforcement Learning (RL) algorithms have had tremendous success in simulated domains. These algorithms, however, often cannot be directly applied to physical systems, especially in cases where there are constraints to satisfy (e.g. to ensure safety or limit resource consumption). In standard RL, the agent is incentivized to explore any policy with the sole goal of maximizing reward; in the real world, however, ensuring satisfaction of certain constraints in the process is also necessary and essential. In this article, we overview existing approaches addressing constraints in model-free reinforcement learning. We model the problem of learning with constraints as a Constrained Markov Decision Process and consider two main types of constraints: cumulative and instantaneous. We summarize existing approaches and discuss their pros and cons. To evaluate policy performance under constraints, we introduce a set of standard benchmarks and metrics. We also summarize limitations of current methods and present open questions for future research.

强化学习 (RL) 算法在模拟域中取得了巨大成功。但是,这些算法通常不能直接应用于物理系统,尤其是在存在需要满足的限制的情况下(例如,确保安全或限制资源消耗)。在标准 RL 中,代理被激励去探索任何政策,其唯一目标是最大化奖励:然而,在现实世界中,确保满足这一进程中的某些限制也是必要和必要的。在本文中,我们概述了解决无模型强化学习中限制的现有方法。我们将具有约束的学习问题建模为受约束的 Markov 决策过程,并考虑两种主要类型的约束:累积和瞬时约束。我们总结现有方法,并讨论其利弊。为了在约束下评估政策绩效,我们引入了一套标准基准和指标。我们还总结了当前方法的局限性,并为未来的研究提出了开放性问题。

Sister Conferences Best Papers

25.Deep Residual Reinforcement Learning (Extended Abstract)

We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in commonly used benchmarks. Moreover, we find the residual algorithm an effective approach to the distribution mismatch problem in model-based planning. Compared with the existing TD(k) method, our residual-based method makes weaker assumptions about the model and yields a greater performance boost.

我们重温无模型和基于模型的强化学习设置中的剩余算法。我们提出了双向目标网络技术,以稳定剩余算法,产生一个剩余版本的DDPG,在常用基准中显著优于香草DDPG。此外,我们发现残余算法是解决基于模型的规划中的分配不匹配问题的有效方法。与现有的 TD (k) 方法相比,我们的基于残余的方法对模型的假设较弱,性能提升更大。

Doctoral Consortium

26.Planning and Reinforcement Learning for General-Purpose Service Robots

Despite recent progress in AI and robotics research, especially learned robot skills, there remain significant challenges in building robust, scalable, and general-purpose systems for service robots. This Ph.D. research aims to combine symbolic planning and reinforcement learning to reason about high-level robot tasks and adapt to the real world. We will introduce task planning algorithms that adapt to the environment and other agents, as well as reinforcement learning methods that are practical for service robot systems. Taken together, this work will make a significant step towards creating general-purpose service robots

尽管人工智能和机器人研究最近取得了进展,特别是学习了机器人技能,但在为服务机器人构建坚固、可扩展和通用系统方面仍面临重大挑战。本博士研究旨在将象征性的规划和强化学习与高级机器人任务的推理和适应现实世界相结合。我们将引入适应环境和其他代理的任务规划算法,以及实用于服务机器人系统的强化学习方法。综合起来,这项工作将朝着创建通用服务机器人迈出重要一步。

27.Deep Reinforcement Learning with Hierarchical Structures

Hierarchical reinforcement learning (HRL), which enables control at multiple time scales, is a promising paradigm to solve challenging and long-horizon tasks. In this paper, we briefly introduce our work in bottom-up and top-down HRL and outline the directions for future work.

分层强化学习 (HRL) 能够多次控制,是解决具有挑战性和远视任务的有希望的范式。本文简要介绍了自下而上和自上而下的 HRL 工作,并概述了未来工作的方向。

28.Combining Reinforcement Learning and Causal Models for Robotics Applications

The relation between Reinforcement learning (RL) and Causal Modeling(CM) is an underexplored area with untapped potential for any learning task. In this extended abstract of our Ph.D. research proposal, we present a way to combine both areas to improve their respective learning processes, especially in the context of our application area (service robotics). The preliminary results obtained so far are a good starting point for thinking about the success of our research project.

‎强化学习 (RL) 和因果建模 (CM) 之间的关系是一个尚未开发的领域,对于任何学习任务来说,都具有未开发的潜力。在博士研究建议的扩展摘要中,我们提出了一种将这两个领域结合起来的方法,以改进各自的学习过程,尤其是在我们的应用领域(服务机器人)方面。迄今取得的初步成果是思考我们研究项目成功与否的良好起点。‎

29.Inter-Task Similarity for Lifelong Reinforcement Learning in Heterogeneous Tasks

Reinforcement learning (RL) is a learning paradigm in which an agent interacts with the environment it inhabits to learn in a trial-and-error way. By letting the agent acquire knowledge from its own experience, RL has been successfully applied to complex domains such as robotics. However, for non-trivial problems, training an RL agent can take very long periods of time. Lifelong machine learning (LML) is a learning setting in which the agent learns to solve tasks sequentially, by leveraging knowledge accumulated from previously solved tasks to learn better/faster in a new one. Most LML works heavily rely on the assumption that tasks are similar to each other. However, this may not be true for some domains with a high degree of task-diversity that could benefit from adopting a lifelong learning approach, e.g., service robotics. Therefore, in this research we will address the problem of learning to solve a sequence of RL heterogeneous tasks (i.e., tasks that differ in their state-action space).

强化学习 (RL) 是一种学习范式,其中代理与它所居住的环境相互作用,以试错的方式学习。通过让代理从自身经验中获取知识,RL 已成功应用于机器人等复杂领域。但是,对于非平凡的问题,培训 RL 代理可能需要很长时间。终身机器学习(热爱我的生活)是代理通过利用以前解决的任务积累的知识,在新的任务中学习更好/更快地按顺序完成任务的学习环境。大多数人喜欢我的生活工作严重依赖于任务彼此相似的假设。然而,对于一些任务多样性程度高的领域来说,情况可能并非如此,这些领域可能受益于采用终身学习方法,例如服务机器人技术。因此,在这项研究中,我们将解决学习解决一系列RL异构任务(即不同于其状态行动空间的任务)的问题。

Early Career

30.Width-Based Algorithms for Common Problems in Control, Planning and Reinforcement Learning

Width-based algorithms search for solutions through a general definition of state novelty. These algorithms have been shown to result in state-of-the-art performance in classical planning, and have been successfully applied to model-based and model-free settings where the dynamics of the problem are given through simulation engines. Width-based algorithms performance is understood theoretically through the notion of planning width, providing polynomial guarantees on their runtime and memory consumption. To facilitate synergies across research communities, this paper summarizes the area of width-based planning, and surveys current and future research directions.

基于宽度的算法通过状态新颖性的总体定义来搜索解决方案。这些算法已被证明在经典规划中具有最先进的性能,并已成功应用于基于模型和无模型设置,其中通过模拟引擎给出问题的动态。基于宽度的算法性能理论上通过规划宽度的概念来理解,为其运行时间和内存消耗提供多方面保证。为了促进研究社区的协同效应,本文总结了基于宽度的规划领域,并调查了当前和未来的研究方向。








编辑于 2021-10-23 14:57