有资源限制的在线部署 (Offline RL With Resource Constrained Online Deployment)

Offline reinforcement learning is used to train policies in scenarios where real-time access to the environment is expensive or impossible. As a natural consequence of these harsh conditions, an agent may lack the resources to fully observe the online environment before taking an action. We dub this situation the resource-constrained setting. This leads to situations where the offline dataset (available for training) can contain fully processed features (using powerful language models, image models, complex sensors, etc.) which are not available when actions are actually taken online. This disconnect leads to an interesting and unexplored problem in offline RL: Is it possible to use a richly processed offline dataset to train a policy which has access to fewer features in the online environment? In this work, we introduce and formalize this novel resource-constrained problem setting. We highlight the performance gap between policies trained using the full offline dataset and policies trained using limited features. We address this performance gap with a policy transfer algorithm which first trains a teacher agent using the offline dataset where features are fully available, and then transfers this knowledge to a student agent that only uses the resource-constrained features. To better capture the challenge of this setting, we propose a data collection procedure: Resource Constrained-Datasets for RL (RC-D4RL). We evaluate our transfer algorithm on RC-D4RL and the popular D4RL benchmarks and observe consistent improvement over the baseline (TD3+BC without transfer). The code for the experiments is available at https://github.com/JayanthRR/RC-OfflineRL}{github.com/RC-OfflineRL.

翻译：离线强化学习被用于在实时访问环境是昂贵或不可能的情景中培训政策。作为这些严酷条件的自然后果,代理商可能缺乏在采取行动之前充分观察在线环境的资源。我们认为这种情况是资源限制的设置。这导致离线数据集(可用于培训的)能够包含完全处理的功能(使用强大的语言模型、图像模型、复杂的传感器等),而当实际在线采取行动时,这些功能是无法获得的。这种脱线导致离线RL中一个有趣的和未探索的问题:是否有可能使用一个经过大量处理的离线数据集来培训一项在在线环境中能访问较少功能的政策? 在这项工作中,我们引入和正式确定这种新的资源限制问题设置。我们强调使用全线离线数据设置所培训的政策与使用有限功能所培训的政策之间的绩效差距。我们用一种政策传输算法来解决这一差距,首先用离线数据存储的 RBRC+R+Com数据库进行培训,然后将这一知识转让给一个只使用资源- RCRL 和 RRC 基准参数的学生代理商(我们建议对 RR- RR- d 进行持续的数据评估)。