翻译标题：在样本内的 Softmax 用于离线强化学习翻译摘要：强化学习 (RL) 代理可以利用先前收集的数据批次来提取合理的控制策略。然而，在这种离线 RL 环境中一个新出现的问题是后续更新依赖于许多方法会导致不足的动作覆盖：标准最大值操作符可能选择一个在数据集中未见过的最大动作。从这些不准确的值进行后续学习更新可能导致过高估计乃至发散。有越来越多的方法尝试近似到一个在样本内涵盖良好的最大值，强制只使用数据集内的行动。我们强调一个简单的事实：更直观的是可能近似样本内的 softmax，只使用数据集内的行动。我们证明基于样本内 softmax 的策略迭代是收敛的，并且对于温度降低时，它会接近于样本内最大值。我们推导出一种基于样本内 softmax 的输入——样本内 Actor-Critic (AC)，并表明它始终优于或与现有的离线 RL 方法相当，也很适合微调。 (The In-Sample Softmax for Offline Reinforcement Learning)

翻译：翻译标题：在样本内的 Softmax 用于离线强化学习翻译摘要：强化学习 (RL) 代理可以利用先前收集的数据批次来提取合理的控制策略。然而，在这种离线 RL 环境中一个新出现的问题是后续更新依赖于许多方法会导致不足的动作覆盖：标准最大值操作符可能选择一个在数据集中未见过的最大动作。从这些不准确的值进行后续学习更新可能导致过高估计乃至发散。有越来越多的方法尝试近似到一个在样本内涵盖良好的最大值，强制只使用数据集内的行动。我们强调一个简单的事实：更直观的是可能近似样本内的 softmax，只使用数据集内的行动。我们证明基于样本内 softmax 的策略迭代是收敛的，并且对于温度降低时，它会接近于样本内最大值。我们推导出一种基于样本内 softmax 的输入——样本内 Actor-Critic (AC)，并表明它始终优于或与现有的离线 RL 方法相当，也很适合微调。

Chenjun Xiao,Han Wang,Yangchen Pan,Adam White,Martha White

Reinforcement learning (RL) agents can leverage batches of previously collected data to extract a reasonable control policy. An emerging issue in this offline RL setting, however, is that the bootstrapping update underlying many of our methods suffers from insufficient action-coverage: standard max operator may select a maximal action that has not been seen in the dataset. Bootstrapping from these inaccurate values can lead to overestimation and even divergence. There are a growing number of methods that attempt to approximate an \emph{in-sample} max, that only uses actions well-covered by the dataset. We highlight a simple fact: it is more straightforward to approximate an in-sample \emph{softmax} using only actions in the dataset. We show that policy iteration based on the in-sample softmax converges, and that for decreasing temperatures it approaches the in-sample max. We derive an In-Sample Actor-Critic (AC), using this in-sample softmax, and show that it is consistently better or comparable to existing offline RL methods, and is also well-suited to fine-tuning.

翻译：（注：本模型仅能提供简单的机翻，如有不妥欢迎指出）