Deep reinforcement learning (DRL) is one of the most powerful tools for synthesizing complex robotic behaviors. But training DRL models is incredibly compute and memory intensive, requiring large training datasets and replay buffers to achieve performant results. This poses a challenge for the next generation of field robots that will need to learn on the edge to adapt to their environment. In this paper, we begin to address this issue through observation space quantization. We evaluate our approach using four simulated robot locomotion tasks and two state-of-the-art DRL algorithms, the on-policy Proximal Policy Optimization (PPO) and off-policy Soft Actor-Critic (SAC) and find that observation space quantization reduces overall memory costs by as much as 4.2x without impacting learning performance.
翻译:深强化学习( DRL) 是综合复杂机器人行为最有力的工具之一。 但是,培训 DRL 模型是令人难以置信的计算和记忆密集的,需要大量的培训数据集和回放缓冲来取得效果效果。 这给下一代实地机器人带来了挑战,他们需要在边缘学习以适应环境。 在本文中,我们开始通过观测空间量度来解决这一问题。 我们用四种模拟机器人移动任务和两种最先进的DRL算法来评估我们的方法,即政策上的准政策优化(PPPO)和不政策上的软动作-crict(SAC),发现观测空间四分化会减少总体记忆成本,而不会影响学习表现。