Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.
翻译:妨碍将强化学习(RL)应用到现实世界问题的原因中,有两个因素至关重要:数据有限,测试环境(政策部署所处的真实环境)与培训环境(例如模拟器)不匹配,本文试图与分布性强的离线RL同时解决这些问题,我们通过优化对源环境最坏情况扰动,学习使用从源环境获取的历史数据的分布性强的政策。特别是,我们超越了表格设置,考虑线性函数近似。更具体地说,我们考虑两个设置,一个是数据集得到很好探索的,另一个是数据集对最佳政策有足够覆盖的。我们建议两种设置中每个都使用一种算法,以达到错误的 $\tilde{O}(d ⁇ 1/2}/N ⁇ 1/2} 美元和$\tilde{O}(d ⁇ 3/2}/N ⁇ 1/2}分别使用最坏情况,而考虑线性函数近似值。我们认为,美元是线性功能最精确的尺寸,另一个是数据集有足够的覆盖最佳范围。我们所进行的最复杂度。我们所进行的数据分析的结果显示为最不透明性。