Reinforcement learning in large-scale environments is challenging due to the many possible actions that can be taken in specific situations. We have previously developed a means of constraining, and hence speeding up, the search process through the use of motion primitives; motion primitives are sequences of pre-specified actions taken across a state series. As a byproduct of this work, we have found that if the motion primitives' motions and actions are labeled, then the search can be sped up further. Since motion primitives may initially lack such details, we propose a theoretically viewpoint-insensitive and speed-insensitive means of automatically annotating the underlying motions and actions. We do this through a differential-geometric, spatio-temporal kinematics descriptor, which analyzes how the poses of entities in two motion sequences change over time. We use this descriptor in conjunction with a weighted-nearest-neighbor classifier to label the primitives using a limited set of training examples. In our experiments, we achieve high motion and action annotation rates for human-action-derived primitives with as few as one training sample. We also demonstrate that reinforcement learning using accurately labeled trajectories leads to high-performing policies more quickly than standard reinforcement learning techniques. This is partly because motion primitives encode prior domain knowledge and preempt the need to re-discover that knowledge during training. It is also because agents can leverage the labels to systematically ignore action classes that do not facilitate task objectives, thereby reducing the action space.
翻译:由于在特定情况下可以采取许多可能的行动,大规模环境中的强化学习具有挑战性,因为在大规模环境中的强化学习之所以具有挑战性,是因为在特定情况下可以采取许多可能的行动。我们以前已经开发了一种限制,从而通过使用运动原始体来加快搜索过程的手段;运动原始体是一系列州级预先指定的行动的序列。作为这项工作的副产品,我们发现,如果运动原始体的动作和行动被贴上标签,那么搜索就可以进一步加快。由于运动原始体最初可能缺乏这样的细节,我们提议了一种理论上的对视觉不敏感和速度不敏感的方法,自动地说明基本动作和行动。我们这样做的方法是通过差异地测量、阵列运动即时运动的运动描述,从而分析两个运动序列中实体的构成如何随时间变化。我们发现,如果将这个描述与加权原始体的动作和行动标签加在一起,则可以用有限的培训范例来标记原始体的原始体型,我们提出的运动和行动说明率很高,而人类原始体运动运动和动作说明率则低为少数,因此,我们通过一种训练方法来快速地进行,因为我们学习前的校准的校准,因为先变的模的校程的校程的校准的校准的校准是学习,因为我们的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正。