The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this fact, how dataset characteristics influence Offline RL algorithms is still hardly investigated. The dataset characteristics are determined by the behavioral policy that samples this dataset. Therefore, we define characteristics of behavioral policies as exploratory for yielding high expected information in their interaction with the Markov Decision Process (MDP) and as exploitative for having high expected return. We implement two corresponding empirical measures for the datasets sampled by the behavioral policy in deterministic MDPs. The first empirical measure SACo is defined by the normalized unique state-action pairs and captures exploration. The second empirical measure TQ is defined by the normalized average trajectory return and captures exploitation. Empirical evaluations show the effectiveness of TQ and SACo. In large-scale experiments using our proposed measures, we show that the unconstrained off-policy Deep Q-Network family requires datasets with high SACo to find a good policy. Furthermore, experiments show that policy constraint algorithms perform well on datasets with high TQ and SACo. Finally, the experiments show, that purely dataset-constrained Behavioral Cloning performs competitively to the best Offline RL algorithms for datasets with high TQ.
翻译:在现实世界环境中应用加强学习(RL)可能昂贵或风险,因为培训期间的政策不尽人意。在Offline RL中,这一问题之所以避免,是因为禁止与环境发生互动。政策是从一个特定数据集中学习的,而该数据集仅决定其性能。尽管如此,数据设置特点如何影响离线RL算法仍然很少调查。数据集特性是由抽样该数据集的行为政策所决定的。因此,我们定义行为政策的特点,作为在与Markov决策过程(MDP)的互动中提供高预期信息的探索,以及作为高预期回报的剥削。我们为在确定性 MDPs中通过行为政策抽样抽样的数据集执行两种相应的实证措施。第一个实证措施是由标准化的独特状态对配对和捕捉式的算法界定的。第二个实证计量标准是由正常平均轨迹返回和捕捉利用所定义的。实证性评估显示TQ和SACo。在大规模实验中,我们用拟议的措施,我们展示了不连贯的离轨的高级数据实验,我们展示了在深度政策上进行高水平的高级数据测试。