有资源限制的在线部署 (Offline RL With Resource Constrained Online Deployment)

Offline reinforcement learning is used to train policies in scenarios where real-time access to the environment is expensive or impossible. As a natural consequence of these harsh conditions, an agent may lack the resources to fully observe the online environment before taking an action. We dub this situation the resource-constrained setting. This leads to situations where the offline dataset (available for training) can contain fully processed features (using powerful language models, image models, complex sensors, etc.) which are not available when actions are actually taken online. This disconnect leads to an interesting and unexplored problem in offline RL: Is it possible to use a richly processed offline dataset to train a policy which has access to fewer features in the online environment? In this work, we introduce and formalize this novel resource-constrained problem setting. We highlight the performance gap between policies trained using the full offline dataset and policies trained using limited features. We address this performance gap with a policy transfer algorithm which first trains a teacher agent using the offline dataset where features are fully available, and then transfers this knowledge to a student agent that only uses the resource-constrained features. To better capture the challenge of this setting, we propose a data collection procedure: Resource Constrained-Datasets for RL (RC-D4RL). We evaluate our transfer algorithm on RC-D4RL and the popular D4RL benchmarks and observe consistent improvement over the baseline (TD3+BC without transfer). The code for the experiments is available at https://github.com/JayanthRR/RC-OfflineRL.

翻译：离线强化学习用于在实时访问环境费用昂贵或不可能的情况下进行政策培训。作为这些严酷条件的自然后果,代理商可能缺乏在采取行动之前充分观察在线环境的资源。我们将这种情况描述为资源限制的设置。这导致离线数据集(可用于培训的)能够包含完全处理的功能(使用强大的语言模型、图像模型、复杂的传感器等),当实际在线采取行动时,这些功能是无法获得的。这种脱线导致离线RL出现一个有趣和未探索的问题:是否有可能使用一个经过大量处理的离线数据集来培训一项在网上环境中能访问较少功能的政策?在此工作中,我们引入并正式确定这种新的资源限制问题设置。我们强调使用全线离线数据设置培训的政策与使用有限特性培训的政策之间的业绩差距。我们用一种政策传输算法来解决这一差距,首先用具备全部功能的离线数据设置培训教师代理商,然后将这一知识转让给一个仅使用资源限制的 RRL 基准的学生代理商。 (我们提议在不使用资源限制的 R- RL 基准上进行不断的数据收集的挑战)。