In many computational science and engineering applications, the output of a system of interest corresponding to a given input can be queried at different levels of fidelity with different costs. Typically, low-fidelity data is cheap and abundant, while high-fidelity data is expensive and scarce. In this work we study the reinforcement learning (RL) problem in the presence of multiple environments with different levels of fidelity for a given control task. We focus on improving the RL agent's performance with multifidelity data. Specifically, a multifidelity estimator that exploits the cross-correlations between the low- and high-fidelity returns is proposed to reduce the variance in the estimation of the state-action value function. The proposed estimator, which is based on the method of control variates, is used to design a multifidelity Monte Carlo RL (MFMCRL) algorithm that improves the learning of the agent in the high-fidelity environment. The impacts of variance reduction on policy evaluation and policy improvement are theoretically analyzed by using probability bounds. Our theoretical analysis and numerical experiments demonstrate that for a finite budget of high-fidelity data samples, our proposed MFMCRL agent attains superior performance compared with that of a standard RL agent that uses only the high-fidelity environment data for learning the optimal policy.
翻译:在许多计算科学和工程应用中,一个与某项投入相对应的系统产出可以在不同水平的忠诚度和不同成本下查询。通常,低忠诚度数据是廉价和丰富的,高忠诚度数据是昂贵和稀缺的。在这项工作中,我们研究在多种环境中的强化学习(RL)问题,这种环境对某项控制任务具有不同程度的忠诚度。我们侧重于用多种忠诚度数据改进RL代理器的性能。具体地说,利用低忠诚度和高忠诚度回报之间的交叉关联的多忠诚度估计器,以缩小国家-行动价值函数估计的差异。我们提议的基于控制变异方法的强化学习(RL),用于设计多种忠诚度的蒙特卡洛·RL(MFMCRL)算法,该算法改进了高忠诚度环境中该代理器的学习。差异减少对政策评价和政策改进的影响,通过使用概率约束来进行理论分析。我们关于高忠诚度估计的理论分析和数字实验,以控制方法为基础,用高信任度数据样本来显示我们高信任度预算的高级分析。