机器翻译：翻译后的标题：沟通模仿学习和在线强化学习：一个乐观的故事翻译后的摘要：在本文中，我们解决了以下问题：对于来自不完美专家的离线演示数据集，如何最佳利用它来引导MDPs中的在线学习性能。我们首先提出了一种基于知情后验抽样的RL（iPSRL）算法，它使用离线数据集和有关生成离线数据集的专家行为策略的信息。如果专家足够能干，则其累积贝叶斯风险会快速降至零。由于该算法计算不切实际，因此我们随后提出了iRLSVI算法，可以看作是在线RL和模仿学习的组合。我们的实证结果表明，与无离线数据和使用不关于生成策略的离线数据集的两个基线相比，所提出的iRLSVI算法能够实现显著的风险降低。我们的算法第一次桥接了在线RL和模仿学习。 (Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale)

翻译：机器翻译：翻译后的标题：沟通模仿学习和在线强化学习：一个乐观的故事翻译后的摘要：在本文中，我们解决了以下问题：对于来自不完美专家的离线演示数据集，如何最佳利用它来引导MDPs中的在线学习性能。我们首先提出了一种基于知情后验抽样的RL（iPSRL）算法，它使用离线数据集和有关生成离线数据集的专家行为策略的信息。如果专家足够能干，则其累积贝叶斯风险会快速降至零。由于该算法计算不切实际，因此我们随后提出了iRLSVI算法，可以看作是在线RL和模仿学习的组合。我们的实证结果表明，与无离线数据和使用不关于生成策略的离线数据集的两个基线相比，所提出的iRLSVI算法能够实现显著的风险降低。我们的算法第一次桥接了在线RL和模仿学习。

Botao Hao,Rahul Jain,Dengwang Tang,Zheng Wen

from arxiv, Alphabetical order. Corresponding to Rahul Jain

In this paper, we address the following problem: Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs. We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset. Its cumulative Bayesian regret goes down to zero exponentially fast in N, the offline dataset size if the expert is competent enough. Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning. Our empirical results show that the proposed iRLSVI algorithm is able to achieve significant reduction in regret as compared to two baselines: no offline data, and offline dataset but used without information about the generative policy. Our algorithm bridges online RL and imitation learning for the first time.

翻译：请注意，原文术语含义不明确，上述翻译仅供参考，具体可以根据实际情况调整。