This paper presents a framework for learning visual representations from unlabeled video demonstrations captured from multiple viewpoints. We show that these representations are applicable for imitating several robotic tasks, including pick and place. We optimize a recently proposed self-supervised learning algorithm by applying contrastive learning to enhance task-relevant information while suppressing irrelevant information in the feature embeddings. We validate the proposed method on the publicly available Multi-View Pouring and a custom Pick and Place data sets and compare it with the TCN triplet baseline. We evaluate the learned representations using three metrics: viewpoint alignment, stage classification and reinforcement learning, and in all cases the results improve when compared to state-of-the-art approaches, with the added benefit of reduced number of training iterations.
翻译:本文提供了一个框架,用于学习从多个角度从未贴标签的视频演示中获取的视觉表现。我们显示这些表现适用于模仿数项机器人任务,包括选取和位置。我们优化了最近提出的自我监督学习算法,采用对比性学习方法加强任务相关信息,同时在嵌入功能中抑制不相关信息。我们验证了公开提供的多视图和自定义的选取和地点数据集的拟议方法,并将其与TCN三重基线进行比较。我们用三种衡量标准评估了所学的表达法:观点对齐、阶段分类和强化学习,在所有情况中,与最先进的方法相比,结果都有所改善,增加了培训迭代数减少的好处。