Understanding an agent's priorities by observing their behavior is critical for transparency and accountability in decision processes, such as in healthcare. While conventional approaches to policy learning almost invariably assume stationarity in behavior, this is hardly true in practice: Medical practice is constantly evolving, and clinical professionals are constantly fine-tuning their priorities. We desire an approach to policy learning that provides (1) interpretable representations of decision-making, accounts for (2) non-stationarity in behavior, as well as operating in an (3) offline manner. First, we model the behavior of learning agents in terms of contextual bandits, and formalize the problem of inverse contextual bandits (ICB). Second, we propose two algorithms to tackle ICB, each making varying degrees of assumptions regarding the agent's learning strategy. Finally, through both real and simulated data for liver transplantations, we illustrate the applicability and explainability of our method, as well as validating its accuracy.

0
下载
关闭预览

相关内容

Causal decomposition analysis provides a way to identify mediators that contribute to health disparities between marginalized and non-marginalized groups. In particular, the degree to which a disparity would be reduced or remain after intervening on a mediator is of interest. Yet, estimating disparity reduction and remaining might be challenging for many researchers, possibly because there is a lack of understanding of how each estimation method differs from other methods. In addition, there is no appropriate estimation method available for a certain setting (i.e., a regression-based approach with a categorical mediator). Therefore, we review the merits and limitations of the existing three estimation methods (i.e., regression, weighting, and imputation) and provide two new extensions that are useful in practical settings. A flexible new method uses an extended imputation approach to address a categorical and continuous mediator or outcome while incorporating any nonlinear relationships. A new regression method provides a simple estimator that performs well in terms of bias and variance but at the cost of assuming linearity, except for exposure and mediator interactions. Recommendations are given for choosing methods based on a review of different methods and simulation studies. We demonstrate the practice of choosing an optimal method by identifying mediators that reduce race and gender disparity in cardiovascular health, using data from the Midlife Development in the US study.

0
0
下载
预览

Inverse reinforcement learning is a paradigm motivated by the goal of learning general reward functions from demonstrated behaviours. Yet the notion of generality for learnt costs is often evaluated in terms of robustness to various spatial perturbations only, assuming deployment at fixed speeds of execution. However, this is impractical in the context of robotics and building, time-invariant solutions is of crucial importance. In this work, we propose a formulation that allows us to 1) vary the length of execution by learning time-invariant costs, and 2) relax the temporal alignment requirements for learning from demonstration. We apply our method to two different types of cost formulations and evaluate their performance in the context of learning reward functions for simulated placement and peg in hole tasks executed on a 7DoF Kuka IIWA arm. Our results show that our approach enables learning temporally invariant rewards from misaligned demonstration that can also generalise spatially to out of distribution tasks.

0
0
下载
预览

We present a novel method for safely navigating a robot in unknown and uneven outdoor terrains. Our approach trains a novel Deep Reinforcement Learning (DRL)-based network with channel and spatial attention modules using a novel reward function to compute an attention map of the environment. The attention map identifies regions in the environment's elevation map with high elevation gradients where the robot could have reduced stability or even flip over. We transform this attention map into a 2D navigation cost-map, which encodes the planarity (level of flatness) of the terrain. Using the cost-map, we formulate a novel method for computing local least-cost waypoints leading to the robot's goal and integrate our approach with DWA-RL, a state-of-the-art navigation method. Our approach guarantees safe, locally least-cost paths and dynamically feasible robot velocities in highly uneven terrains. Our hybrid approach also leads to a low sim-to-real gap, which arises while training DRL networks. We observe an improvement in terms of success rate, the cumulative elevation gradient of the robot's trajectory, and the safety of the robot's velocity. We evaluate our method on a real Husky robot in highly uneven real-world terrains and demonstrate its benefits.

0
0
下载
预览

This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

0
23
下载
预览

We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms in the context of continuous-control, when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, but this is not a realistic setting. Indeed, would this reward function be available, it could then directly be used for policy training and imitation would not be necessary. To tackle this mostly ignored problem, we propose a number of possible proxies to the external reward. We evaluate them in an extensive empirical study (more than 10'000 agents across 9 environments) and make practical recommendations for selecting HPs. Our results show that while imitation learning algorithms are sensitive to HP choices, it is often possible to select good enough HPs through a proxy to the reward function.

0
6
下载
预览

Promoting behavioural diversity is critical for solving games with non-transitive dynamics where strategic cycles exist, and there is no consistent winner (e.g., Rock-Paper-Scissors). Yet, there is a lack of rigorous treatment for defining diversity and constructing diversity-aware learning dynamics. In this work, we offer a geometric interpretation of behavioural diversity in games and introduce a novel diversity metric based on \emph{determinantal point processes} (DPP). By incorporating the diversity metric into best-response dynamics, we develop \emph{diverse fictitious play} and \emph{diverse policy-space response oracle} for solving normal-form games and open-ended games. We prove the uniqueness of the diverse best response and the convergence of our algorithms on two-player games. Importantly, we show that maximising the DPP-based diversity metric guarantees to enlarge the \emph{gamescape} -- convex polytopes spanned by agents' mixtures of strategies. To validate our diversity-aware solvers, we test on tens of games that show strong non-transitivity. Results suggest that our methods achieve much lower exploitability than state-of-the-art solvers by finding effective and diverse strategies.

0
8
下载
预览

We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.

0
6
下载
预览

Humans and animals show remarkable flexibility in adjusting their behaviour when their goals, or rewards in the environment change. While such flexibility is a hallmark of intelligent behaviour, these multi-task scenarios remain an important challenge for machine learning algorithms and neurobiological models alike. Factored representations can enable flexible behaviour by abstracting away general aspects of a task from those prone to change, while nonparametric methods provide a principled way of using similarity to past experiences to guide current behaviour. Here we combine the successor representation (SR), that factors the value of actions into expected outcomes and corresponding rewards, with evaluating task similarity through nonparametric inference and clustering the space of rewards. The proposed algorithm improves SR's transfer capabilities by inverting a generative model over tasks, while also explaining important neurobiological signatures of place cell representation in the hippocampus. It dynamically samples from a flexible number of distinct SR maps while accumulating evidence about the current reward context, and outperforms competing algorithms in settings with both known and unsignalled rewards changes. It reproduces the "flickering" behaviour of hippocampal maps seen when rodents navigate to changing reward locations, and gives a quantitative account of trajectory-dependent hippocampal representations (so-called splitter cells) and their dynamics. We thus provide a novel algorithmic approach for multi-task learning, as well as a common normative framework that links together these different characteristics of the brain's spatial representation.

0
3
下载
预览

Hashing, or learning binary embeddings of data, is frequently used in nearest neighbor retrieval. In this paper, we develop learning to rank formulations for hashing, aimed at directly optimizing ranking-based evaluation metrics such as Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG). We first observe that the integer-valued Hamming distance often leads to tied rankings, and propose to use tie-aware versions of AP and NDCG to evaluate hashing for retrieval. Then, to optimize tie-aware ranking metrics, we derive their continuous relaxations, and perform gradient-based optimization with deep neural networks. Our results establish the new state-of-the-art for image retrieval by Hamming ranking in common benchmarks.

0
5
下载
预览

Recommender systems play a crucial role in mitigating the problem of information overload by suggesting users' personalized items or services. The vast majority of traditional recommender systems consider the recommendation procedure as a static process and make recommendations following a fixed strategy. In this paper, we propose a novel recommender system with the capability of continuously improving its strategies during the interactions with users. We model the sequential interactions between users and a recommender system as a Markov Decision Process (MDP) and leverage Reinforcement Learning (RL) to automatically learn the optimal strategies via recommending trial-and-error items and receiving reinforcements of these items from users' feedbacks. In particular, we introduce an online user-agent interacting environment simulator, which can pre-train and evaluate model parameters offline before applying the model online. Moreover, we validate the importance of list-wise recommendations during the interactions between users and agent, and develop a novel approach to incorporate them into the proposed framework LIRD for list-wide recommendations. The experimental results based on a real-world e-commerce dataset demonstrate the effectiveness of the proposed framework.

0
12
下载
预览
小贴士
相关论文
Learning Time-Invariant Reward Functions through Model-Based Inverse Reinforcement Learning
Todor Davchev,Sarah Bechtle,Subramanian Ramamoorthy,Franziska Meier
0+阅读 · 2021年9月14日
Kasun Weerakoon,Adarsh Jagan Sathyamoorthy,Utsav Patel,Dinesh Manocha
0+阅读 · 2021年9月10日
Daniel A. Roberts,Sho Yaida,Boris Hanin
23+阅读 · 2021年6月18日
Leonard Hussenot,Marcin Andrychowicz,Damien Vincent,Robert Dadashi,Anton Raichuk,Lukasz Stafiniak,Sertan Girgin,Raphael Marinier,Nikola Momchev,Sabela Ramos,Manu Orsini,Olivier Bachem,Matthieu Geist,Olivier Pietquin
6+阅读 · 2021年5月25日
Nicolas Perez Nieves,Yaodong Yang,Oliver Slumbers,David Henry Mguni,Jun Wang
8+阅读 · 2021年3月14日
Inferred successor maps for better transfer learning
Tamas J. Madarasz
3+阅读 · 2019年7月2日
Kun He,Fatih Cakir,Sarah Adel Bargal,Stan Sclaroff
5+阅读 · 2018年3月28日
Xiangyu Zhao,Liang Zhang,Zhuoye Ding,Dawei Yin,Yihong Zhao,Jiliang Tang
12+阅读 · 2018年1月5日
相关VIP内容
专知会员服务
31+阅读 · 2021年8月8日
专知会员服务
82+阅读 · 2021年4月17日
可解释强化学习,Explainable Reinforcement Learning: A Survey
专知会员服务
68+阅读 · 2020年5月14日
专知会员服务
205+阅读 · 2020年5月8日
专知会员服务
111+阅读 · 2020年2月1日
Stabilizing Transformers for Reinforcement Learning
专知会员服务
30+阅读 · 2019年10月17日
强化学习最新教程,17页pdf
专知会员服务
75+阅读 · 2019年10月11日
相关资讯
LibRec 精选:AutoML for Contextual Bandits
LibRec智能推荐
6+阅读 · 2019年9月19日
Hierarchically Structured Meta-learning
CreateAMind
13+阅读 · 2019年5月22日
强化学习的Unsupervised Meta-Learning
CreateAMind
7+阅读 · 2019年1月7日
Unsupervised Learning via Meta-Learning
CreateAMind
32+阅读 · 2019年1月3日
meta learning 17年:MAML SNAIL
CreateAMind
9+阅读 · 2019年1月2日
人工智能 | 国际会议截稿信息9条
Call4Papers
4+阅读 · 2018年3月13日
【计算机类】期刊专刊/国际会议截稿信息6条
Call4Papers
3+阅读 · 2017年10月13日
强化学习 cartpole_a3c
CreateAMind
9+阅读 · 2017年7月21日
【今日新增】IEEE Trans.专刊截稿信息8条
Call4Papers
4+阅读 · 2017年6月29日
Top
微信扫码咨询专知VIP会员