机器学习中的巴甫洛夫（Pavlovian）制约：条件算法通过自适应减少负面反馈 (Bandit-Based Policy Invariant Explicit Shaping for Incorporating External Advice in Reinforcement Learning) - 专知论文

会员服务 ·

0

赌博机 · 算法 · 塑造 · 自适应 · Bandits ·

2023 年 4 月 14 日

Bandit-Based Policy Invariant Explicit Shaping for Incorporating External Advice in Reinforcement Learning

翻译：机器学习中的巴甫洛夫（Pavlovian）制约：条件算法通过自适应减少负面反馈

Yash Satsangi,Paniz Behboudian

from arxiv, ALA workshop, AAMAS 2023

A key challenge for a reinforcement learning (RL) agent is to incorporate external/expert1 advice in its learning. The desired goals of an algorithm that can shape the learning of an RL agent with external advice include (a) maintaining policy invariance; (b) accelerating the learning of the agent; and (c) learning from arbitrary advice [3]. To address this challenge this paper formulates the problem of incorporating external advice in RL as a multi-armed bandit called shaping-bandits. The reward of each arm of shaping bandits corresponds to the return obtained by following the expert or by following a default RL algorithm learning on the true environment reward.We show that directly applying existing bandit and shaping algorithms that do not reason about the non-stationary nature of the underlying returns can lead to poor results. Thus we propose UCB-PIES (UPIES), Racing-PIES (RPIES), and Lazy PIES (LPIES) three different shaping algorithms built on different assumptions that reason about the long-term consequences of following the expert policy or the default RL algorithm. Our experiments in four different settings show that these proposed algorithms achieve the above-mentioned goals whereas the other algorithms fail to do so.

翻译：强化学习代理面临的一个关键挑战是如何将外部/专家建议纳入其学习过程。希望用于外部建议来塑造强化学习代理的算法具有以下目标：（a）保持策略不变性；（b）加速代理的学习；以及（c）从任意建议中学习[3]。为了解决这个挑战，本文将将外部建议在RL中的纳入问题作为一个名为塑造赌博机（shaping-bandits）的多臂赌博机模型的形式化问题。塑形赌博机的每个臂的奖励对应于按照专家或按照默认RL算法在真实环境回报上进行学习所获得的回报。我们发现，直接应用现有的赌博和塑形算法，而不考虑底层回报的非平稳性质可能导致差结果。因此，我们提出了三种不同的塑形算法UCB-PIES（UPIES），Racing-PIES（RPIES）和Lazy PIES（LPIES），这些算法基于不同的基础假设，可考虑按照专家策略或默认RL算法的长期后果。我们在四个不同的设置中进行的实验表明，这些提出的算法实现了上述目标，而其他算法却无法实现。

0

相关内容

赌博机

【ETH、Stanford】基于博弈论的运动规划，Tutorial ICRA '21

【ETH、Stanford】基于博弈论的运动规划，Tutorial ICRA '21

专知会员服务

56+阅读 · 2022年3月7日

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

专知会员服务

27+阅读 · 2020年6月10日

【基于模型的强化学习的博弈论框架】A Game Theoretic Framework for Model Based Reinforcement Learning

【基于模型的强化学习的博弈论框架】A Game Theoretic Framework for Model Based Reinforcement Learning

专知会员服务

131+阅读 · 2020年4月19日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

专知会员服务

25+阅读 · 2020年2月28日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【Pieter Abbeel 报告@CMU】元学习与深度强化学习机器人应用，Deep Learning to Learn，84页ppt

【Pieter Abbeel 报告@CMU】元学习与深度强化学习机器人应用，Deep Learning to Learn，84页ppt

专知会员服务

32+阅读 · 2019年10月12日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【ALT 2019 Tutorials】强化学习的探索性开发（Exploration-Exploitation in Reinforcement Learning）

【ALT 2019 Tutorials】强化学习的探索性开发（Exploration-Exploitation in Reinforcement Learning）

专知会员服务

34+阅读 · 2019年3月21日

强化学习扫盲贴：从Q-learning到DQN

强化学习扫盲贴：从Q-learning到DQN

夕小瑶的卖萌屋

52+阅读 · 2019年10月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

腊月廿八 | 强化学习-TRPO和PPO背后的数学

腊月廿八 | 强化学习-TRPO和PPO背后的数学

AI研习社

18+阅读 · 2019年2月2日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

泡泡机器人SLAM

13+阅读 · 2019年1月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【泡泡一分钟】学习紧密的几何特征（ICCV2017-17）

【泡泡一分钟】学习紧密的几何特征（ICCV2017-17）

泡泡机器人SLAM

20+阅读 · 2018年5月8日

非均质各向异性三维动态采空区渗透系数反演研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于多智能体系统的分布式凸优化算法研究

国家自然科学基金

3+阅读 · 2013年12月31日

不确定耦合PDE-ODE系统的自适应镇定

国家自然科学基金

0+阅读 · 2013年12月31日

受限制策略下多臂Bandit过程的理论与应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

解约束优化问题的光滑化同伦方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

多尺度随机微分方程的平均原理

国家自然科学基金

0+阅读 · 2012年12月31日

基于工件恶化的并行批调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

量子过程神经网络模型及算法研究

国家自然科学基金

0+阅读 · 2011年12月31日

可重构环境下软硬件协同设计的算法研究

国家自然科学基金

0+阅读 · 2011年12月31日

动态信任与访问控制策略及其优化研究

国家自然科学基金

1+阅读 · 2008年12月31日

Offline Meta Reinforcement Learning with In-Distribution Online Adaptation

Arxiv

1+阅读 · 2023年6月1日

Policy Optimization for Continuous Reinforcement Learning

Arxiv

0+阅读 · 2023年6月1日

What can online reinforcement learning with function approximation benefit from general coverage conditions?

Arxiv

0+阅读 · 2023年5月31日

Representation-Driven Reinforcement Learning

Arxiv

0+阅读 · 2023年5月31日

Active causal structure learning with advice

Arxiv

0+阅读 · 2023年5月31日

Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning

Arxiv

0+阅读 · 2023年5月29日

Privileged Knowledge Distillation for Sim-to-Real Policy Generalization

Arxiv

0+阅读 · 2023年5月29日

Transfer Learning in Deep Reinforcement Learning: A Survey

Transfer Learning in Deep Reinforcement Learning: A Survey

Arxiv

23+阅读 · 2020年9月16日

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Arxiv

20+阅读 · 2020年3月10日

A Multi-Objective Deep Reinforcement Learning Framework

A Multi-Objective Deep Reinforcement Learning Framework

Arxiv

16+阅读 · 2018年6月27日

VIP会员

文章信息

相关主题

相关VIP内容

【ETH、Stanford】基于博弈论的运动规划，Tutorial ICRA '21

【ETH、Stanford】基于博弈论的运动规划，Tutorial ICRA '21

专知会员服务

56+阅读 · 2022年3月7日

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

【SIGIR2020】策略感知的无偏排序学习—Top-K排序，Policy-Aware Unbiased Learning to Rank for Top-𝑘 Rankings

专知会员服务

27+阅读 · 2020年6月10日

【基于模型的强化学习的博弈论框架】A Game Theoretic Framework for Model Based Reinforcement Learning

【基于模型的强化学习的博弈论框架】A Game Theoretic Framework for Model Based Reinforcement Learning

专知会员服务

131+阅读 · 2020年4月19日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

专知会员服务

25+阅读 · 2020年2月28日

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

【牛津大学】深度残差强化学习，Deep Residual Reinforcement Learning

专知会员服务

84+阅读 · 2020年2月18日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【Pieter Abbeel 报告@CMU】元学习与深度强化学习机器人应用，Deep Learning to Learn，84页ppt

【Pieter Abbeel 报告@CMU】元学习与深度强化学习机器人应用，Deep Learning to Learn，84页ppt

专知会员服务

32+阅读 · 2019年10月12日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【ALT 2019 Tutorials】强化学习的探索性开发（Exploration-Exploitation in Reinforcement Learning）

【ALT 2019 Tutorials】强化学习的探索性开发（Exploration-Exploitation in Reinforcement Learning）

专知会员服务

34+阅读 · 2019年3月21日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】在低维和高维空间中分析、建模和转换潜在表征

从无人机到数据：揭示边缘计算作为新作战域

可解释人工智能的基础

大规模视觉模型中的基于提示的适应：综述

相关资讯

强化学习扫盲贴：从Q-learning到DQN

强化学习扫盲贴：从Q-learning到DQN

夕小瑶的卖萌屋

52+阅读 · 2019年10月13日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

腊月廿八 | 强化学习-TRPO和PPO背后的数学

腊月廿八 | 强化学习-TRPO和PPO背后的数学

AI研习社

18+阅读 · 2019年2月2日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

【泡泡一分钟】基于机器人的视觉惯性里程计（IROS2018-10）

泡泡机器人SLAM

13+阅读 · 2019年1月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

【泡泡一分钟】学习紧密的几何特征（ICCV2017-17）

【泡泡一分钟】学习紧密的几何特征（ICCV2017-17）

泡泡机器人SLAM

20+阅读 · 2018年5月8日

相关论文

Offline Meta Reinforcement Learning with In-Distribution Online Adaptation

Arxiv

1+阅读 · 2023年6月1日

Policy Optimization for Continuous Reinforcement Learning

Arxiv

0+阅读 · 2023年6月1日

What can online reinforcement learning with function approximation benefit from general coverage conditions?

Arxiv

0+阅读 · 2023年5月31日

Representation-Driven Reinforcement Learning

Arxiv

0+阅读 · 2023年5月31日

Active causal structure learning with advice

Arxiv

0+阅读 · 2023年5月31日

Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning

Arxiv

0+阅读 · 2023年5月29日

Privileged Knowledge Distillation for Sim-to-Real Policy Generalization

Arxiv

0+阅读 · 2023年5月29日

Transfer Learning in Deep Reinforcement Learning: A Survey

Transfer Learning in Deep Reinforcement Learning: A Survey

Arxiv

23+阅读 · 2020年9月16日

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey

Arxiv

20+阅读 · 2020年3月10日

A Multi-Objective Deep Reinforcement Learning Framework

A Multi-Objective Deep Reinforcement Learning Framework

Arxiv

16+阅读 · 2018年6月27日

相关基金

非均质各向异性三维动态采空区渗透系数反演研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于多智能体系统的分布式凸优化算法研究

国家自然科学基金

3+阅读 · 2013年12月31日

不确定耦合PDE-ODE系统的自适应镇定

国家自然科学基金

0+阅读 · 2013年12月31日

受限制策略下多臂Bandit过程的理论与应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

解约束优化问题的光滑化同伦方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

多尺度随机微分方程的平均原理

国家自然科学基金

0+阅读 · 2012年12月31日

基于工件恶化的并行批调度研究

国家自然科学基金

0+阅读 · 2012年12月31日

量子过程神经网络模型及算法研究

国家自然科学基金

0+阅读 · 2011年12月31日

可重构环境下软硬件协同设计的算法研究

国家自然科学基金

0+阅读 · 2011年12月31日

动态信任与访问控制策略及其优化研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员