We study meta-learning in Markov Decision Processes (MDP) with linear transition models in the undiscounted episodic setting. Under a task sharedness metric based on model proximity we study task families characterized by a distribution over models specified by a bias term and a variance component. We then propose BUC-MatrixRL, a version of the UC-Matrix RL algorithm, and show it can meaningfully leverage a set of sampled training tasks to quickly solve a test task sampled from the same task distribution by learning an estimator of the bias parameter of the task distribution. The analysis leverages and extends results in the learning to learn linear regression and linear bandit setting to the more general case of MDP's with linear transition models. We prove that compared to learning the tasks in isolation, BUC-Matrix RL provides significant improvements in the transfer regret for high bias low variance task distributions.
翻译:我们研究Markov决定过程(MDP)的元学习,在未贴现的偶发环境中使用线性过渡模型进行线性过渡模型。根据基于模拟近距离的分工标准,我们研究以偏差术语和差异部分指定的模型分布为特点的任务家庭。然后我们提出BUC-MatrixRL,这是UC-Matrix RL算法的版本,并表明它能够有意义地利用一组抽样培训任务,通过学习任务分布偏差参数的估测,迅速解决从同一任务分布中抽取的测试任务。分析利用并推广学习线性回归和线性带设置的结果,学习MDP的线性过渡模型这一更为一般的情况。我们证明,与孤立性任务相比,BUC-Matrix RL对高偏差低任务分布的转移遗憾有很大改进。