Offline reinforcement learning (RL) learns exclusively from static datasets, without further interaction with the environment. In practice, such datasets vary widely in quality, often mixing expert, suboptimal, and even random trajectories. The choice of algorithm therefore depends on dataset fidelity. Behavior cloning can suffice on high-quality data, whereas mixed- or low-quality data typically benefits from offline RL methods that stitch useful behavior across trajectories. Yet in the wild it is difficult to assess dataset quality a priori because the data's provenance and skill composition are unknown. We address the problem of estimating offline dataset quality without training an agent. We study a spectrum of proxies from simple cumulative rewards to learned value based estimators, and introduce the Bellman Wasserstein distance (BWD), a value aware optimal transport score that measures how dissimilar a dataset's behavioral policy is from a random reference policy. BWD is computed from a behavioral critic and a state conditional OT formulation, requiring no environment interaction or full policy optimization. Across D4RL MuJoCo tasks, BWD strongly correlates with an oracle performance score that aggregates multiple offline RL algorithms, enabling efficient prediction of how well standard agents will perform on a given dataset. Beyond prediction, integrating BWD as a regularizer during policy optimization explicitly pushes the learned policy away from random behavior and improves returns. These results indicate that value aware, distributional signals such as BWD are practical tools for triaging offline RL datasets and policy optimization.
翻译:离线强化学习(RL)仅从静态数据集中学习,无需与环境进一步交互。在实践中,此类数据集的质量差异很大,常常混合了专家、次优甚至随机的轨迹。因此,算法的选择取决于数据集的保真度。在高质量数据上,行为克隆可能已足够;而对于混合质量或低质量数据,通常受益于能够跨轨迹拼接有用行为的离线RL方法。然而,在实际应用中,由于数据的来源和技能构成未知,很难先验地评估数据集质量。我们解决了无需训练智能体即可估计离线数据集质量的问题。我们研究了一系列代理指标,从简单的累积奖励到基于学习价值的估计器,并引入了贝尔曼-瓦瑟斯坦距离(BWD)——一种基于价值感知的最优传输评分,用于衡量数据集的行为策略与随机参考策略的差异程度。BWD通过行为评论家和状态条件最优传输公式计算,无需环境交互或完整的策略优化。在D4RL MuJoCo任务中,BWD与聚合了多种离线RL算法的oracle性能评分强相关,从而能够有效预测标准智能体在给定数据集上的表现。除了预测功能外,在策略优化过程中将BWD作为正则化项,可以显式地将学习到的策略推离随机行为,并提高回报。这些结果表明,诸如BWD这类价值感知的分布信号,是筛选离线RL数据集和优化策略的实用工具。