利用机器人辅助喂养应用程序,利用在强盗环境中加快学习的后特设环境 (Leveraging Post Hoc Context for Faster Learning in Bandit Settings with Applications in Robot-Assisted Feeding)

Autonomous robot-assisted feeding requires the ability to acquire a wide variety of food items. However, it is impossible for such a system to be trained on all types of food in existence. Therefore, a key challenge is choosing a manipulation strategy for a previously unseen food item. Previous work showed that the problem can be represented as a linear bandit with visual context. However, food has a wide variety of multi-modal properties relevant to manipulation that can be hard to distinguish visually. Our key insight is that we can leverage the haptic context we collect during and after manipulation (i.e., "post hoc") to learn some of these properties and more quickly adapt our visual model to previously unseen food. In general, we propose a modified linear contextual bandit framework augmented with post hoc context observed after action selection to empirically increase learning speed and reduce cumulative regret. Experiments on synthetic data demonstrate that this effect is more pronounced when the dimensionality of the context is large relative to the post hoc context or when the post hoc context model is particularly easy to learn. Finally, we apply this framework to the bite acquisition problem and demonstrate the acquisition of 8 previously unseen types of food with 21% fewer failures across 64 attempts.

翻译：自主的机器人辅助喂养需要能够获取各种各样的食物。但是,这样的系统不可能在存在的所有食物类型上接受培训。因此,一个关键的挑战就是选择一种对先前看不见的食物物品的操纵策略。先前的工作表明,这个问题可以作为直线土匪和视觉环境来代表。然而,食物具有与操纵有关的多种多模式特性,这些特性很难辨别。我们的关键洞察力是,我们可以利用我们在操作期间和操作后(即“临时”后)收集的偶然环境来了解其中一些特性,并更快地将我们的视觉模型适应于先前看不见的食物。一般来说,我们提议一个经过修改的线形环境强盗框架,在行动选择后,通过观察临时环境后加以强化,以便从经验上提高学习速度并减少累积的遗憾。合成数据的实验表明,当环境的维度与事后环境相比很大,或者当后一种环境模型特别容易了解时,这种影响就更加明显。最后,我们将这一框架应用于咬牙的获取问题,并展示获得8种先前看不见的粮食种类,64次尝试中少有21 %的失败。