报酬职能差异的量化 (Quantifying Differences in Reward Functions) - 专知论文

会员服务 ·

0

奖励函数 · 优化器 · 泛函 · 学成 · 回合 ·

2021 年 3 月 17 日

Quantifying Differences in Reward Functions

翻译：报酬职能差异的量化

Adam Gleave,Michael Dennis,Shane Legg,Stuart Russell,Jan Leike

from arxiv, Published at ICLR 2021. 9 pages main paper, 42 pages total

For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https://github.com/HumanCompatibleAI/evaluating-rewards.

翻译：对于许多任务来说,奖励功能是无法反省的,或者过于复杂,无法在程序上具体说明,而是必须从用户数据中学习。先前的工作通过评价最佳的奖赏政策,评价了学到的奖赏功能;然而,这种方法无法区分学习的奖赏功能未能反映用户的偏好,而政策优化进程未能优化获得的奖赏。此外,这一方法只能告诉我们评价环境中的行为,但奖励可能激励即使是在稍微不同的部署环境中也存在非常不同的行为。为了解决这些问题,我们引入了等值-政策差异性比较(EPIC)距离,直接量化两种奖赏功能之间的差异,而不采取政策优化步骤。我们证明EPIC在等值的奖赏功能类别上没有差异性,总是产生同样的最佳政策。此外,我们发现EPIC可以有效地近似和强于选择覆盖面分配的基线。最后,我们证明EPIC的距离限制了最佳政策的遗憾,即使在不同的过渡动态下也是如此,我们从经验上确认它预测政策培训成功与否。我们的源代码可以在 http://githureub/Commalatevab.

1

相关内容

奖励函数

【TPAMI2021】鲁棒可微SVD，Robust Differentiable SVD

专知会员服务

23+阅读 · 2021年4月10日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

【MIT】反偏差对比学习，Debiased Contrastive Learning

【MIT】反偏差对比学习，Debiased Contrastive Learning

专知会员服务

91+阅读 · 2020年7月4日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

深度神经网络模型的个体差异，Individual differences among deep neural network models

深度神经网络模型的个体差异，Individual differences among deep neural network models

专知会员服务

10+阅读 · 2020年1月11日

【Google】视频诱导视觉不变性的自监督学习（Self-Supervised Learning of Video-Induced Visual Invariances），谷歌博士后研究员| Michael Tschannen等

【Google】视频诱导视觉不变性的自监督学习（Self-Supervised Learning of Video-Induced Visual Invariances），谷歌博士后研究员| Michael Tschannen等

专知会员服务

12+阅读 · 2019年12月8日

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

专知会员服务

16+阅读 · 2019年11月30日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【IJCAI 2019】人工智能中的认知推理（Epistemic reasoning in AI），法国雷恩François Schwarzentruber，Tristan Charrier

【IJCAI 2019】人工智能中的认知推理（Epistemic reasoning in AI），法国雷恩François Schwarzentruber，Tristan Charrier

专知会员服务

22+阅读 · 2019年8月10日

已删除

将门创投

4+阅读 · 2019年6月5日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

Safe Learning of Uncertain Environments

Arxiv

0+阅读 · 2021年5月13日

A Deep Reinforcement Learning Approach to Audio-Based Navigation in a Multi-Speaker Environment

Arxiv

0+阅读 · 2021年5月10日

Warped Gradient-Enhanced Gaussian Process Surrogate Models for Inference with Intractable Likelihoods

Arxiv

0+阅读 · 2021年5月10日

Series reversion in Calderón's problem

Arxiv

0+阅读 · 2021年5月7日

Metric Entropy Limits on Recurrent Neural Network Learning of Linear Dynamical Systems

Arxiv

0+阅读 · 2021年5月6日

An ergodic theorem for the weighted ensemble method

Arxiv

0+阅读 · 2021年5月4日

Optimizing Area Under the Curve Measures via Matrix Factorization for Drug-Target Interaction Prediction

Arxiv

0+阅读 · 2021年5月1日

Robust Differentiable SVD

Arxiv

9+阅读 · 2021年4月8日

The Measure of Intelligence

The Measure of Intelligence

Arxiv

7+阅读 · 2019年11月5日

Reward learning from human preferences and demonstrations in Atari

Arxiv

8+阅读 · 2018年11月15日

VIP会员

文章信息

相关主题

相关VIP内容

【TPAMI2021】鲁棒可微SVD，Robust Differentiable SVD

专知会员服务

23+阅读 · 2021年4月10日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

【MIT】反偏差对比学习，Debiased Contrastive Learning

【MIT】反偏差对比学习，Debiased Contrastive Learning

专知会员服务

91+阅读 · 2020年7月4日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

深度神经网络模型的个体差异，Individual differences among deep neural network models

深度神经网络模型的个体差异，Individual differences among deep neural network models

专知会员服务

10+阅读 · 2020年1月11日

【Google】视频诱导视觉不变性的自监督学习（Self-Supervised Learning of Video-Induced Visual Invariances），谷歌博士后研究员| Michael Tschannen等

【Google】视频诱导视觉不变性的自监督学习（Self-Supervised Learning of Video-Induced Visual Invariances），谷歌博士后研究员| Michael Tschannen等

专知会员服务

12+阅读 · 2019年12月8日

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

专知会员服务

16+阅读 · 2019年11月30日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【IJCAI 2019】人工智能中的认知推理（Epistemic reasoning in AI），法国雷恩François Schwarzentruber，Tristan Charrier

【IJCAI 2019】人工智能中的认知推理（Epistemic reasoning in AI），法国雷恩François Schwarzentruber，Tristan Charrier

专知会员服务

22+阅读 · 2019年8月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《毁灭算法：解析以色列在加沙的AI军事行动》

【COLT 2025最新教程】语言生成

以机器速度锁定目标：人工智能的能力与局限

【ICML2025】通过在线世界模型规划的持续强化学习

相关资讯

已删除

将门创投

4+阅读 · 2019年6月5日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

相关论文

Safe Learning of Uncertain Environments

Arxiv

0+阅读 · 2021年5月13日

A Deep Reinforcement Learning Approach to Audio-Based Navigation in a Multi-Speaker Environment

Arxiv

0+阅读 · 2021年5月10日

Warped Gradient-Enhanced Gaussian Process Surrogate Models for Inference with Intractable Likelihoods

Arxiv

0+阅读 · 2021年5月10日

Series reversion in Calderón's problem

Arxiv

0+阅读 · 2021年5月7日

Metric Entropy Limits on Recurrent Neural Network Learning of Linear Dynamical Systems

Arxiv

0+阅读 · 2021年5月6日

An ergodic theorem for the weighted ensemble method

Arxiv

0+阅读 · 2021年5月4日

Optimizing Area Under the Curve Measures via Matrix Factorization for Drug-Target Interaction Prediction

Arxiv

0+阅读 · 2021年5月1日

Robust Differentiable SVD

Arxiv

9+阅读 · 2021年4月8日

The Measure of Intelligence

The Measure of Intelligence

Arxiv

7+阅读 · 2019年11月5日

Reward learning from human preferences and demonstrations in Atari

Arxiv

8+阅读 · 2018年11月15日

微信扫码咨询专知VIP会员