《绿草绿化者》在哪里? (Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning)

The performance of state-of-the-art baselines in the offline RL regime varies widely over the spectrum of dataset qualities, ranging from "far-from-optimal" random data to "close-to-optimal" expert demonstrations. We re-implement these under a fair, unified, and highly factorized framework, and show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end. This consistent trend prevents us from naming a victor that outperforms the rest across the board. We attribute the asymmetry in performance between the two ends of the quality spectrum to the amount of inductive bias injected into the agent to entice it to posit that the behavior underlying the offline dataset is optimal for the task. The more bias is injected, the higher the agent performs, provided the dataset is close-to-optimal. Otherwise, its effect is brutally detrimental. Adopting an advantage-weighted regression template as base, we conduct an investigation which corroborates that injections of such optimality inductive bias, when not done parsimoniously, makes the agent subpar in the datasets it was dominant as soon as the offline policy is sub-optimal. In an effort to design methods that perform well across the whole spectrum, we revisit the generalized policy iteration scheme for the offline regime, and study the impact of nine distinct newly-introduced proposal distributions over actions, involved in proposed generalization of the policy evaluation and policy improvement update rules. We show that certain orchestrations strike the right balance and can improve the performance on one end of the spectrum without harming it on the other end.

翻译：离线 RL 制度下端最先进的基线的性能在数据集质量的各方面差异很大,从“ 远优” 随机数据到“ 近于最佳” 专家演示。我们在一个公平、统一和高度因素化的框架下重新实施这些数据, 并显示当给定的基线在光谱的一端优于相互竞争的对应方时, 它从未在另一端出现过。这种持续的趋势阻止我们点出一个优劣优优优于全局的回归模板。我们把质量频谱两端的性能不对称归到向代理人注入的感官偏差偏差程度, 以及注入的感官偏差性偏差程度, 诱导它假设离线数据集的下方行为是最佳的。越多的偏差性, 代理方表现越高, 如果数据集接近最佳。否则, 其影响是残酷的。以优势加权的回归模式为基础, 我们进行一项调查, 证实它注入的感官偏差的偏差性偏差性, 当它不作粗略地对准地改进了方向,, 使整个政策中的一种压压低的策略上, 我们的策略在总体的策略上显示整个政策的策略上的一项压低重的策略上, 。