We study how choices of input point cloud coordinate frames impact learning of manipulation skills from 3D point clouds. There exist a variety of coordinate frame choices to normalize captured robot-object-interaction point clouds. We find that different frames have a profound effect on agent learning performance, and the trend is similar across 3D backbone networks. In particular, the end-effector frame and the target-part frame achieve higher training efficiency than the commonly used world frame and robot-base frame in many tasks, intuitively because they provide helpful alignments among point clouds across time steps and thus can simplify visual module learning. Moreover, the well-performing frames vary across tasks, and some tasks may benefit from multiple frame candidates. We thus propose FrameMiners to adaptively select candidate frames and fuse their merits in a task-agnostic manner. Experimentally, FrameMiners achieves on-par or significantly higher performance than the best single-frame version on five fully physical manipulation tasks adapted from ManiSkill and OCRTOC. Without changing existing camera placements or adding extra cameras, point cloud frame mining can serve as a free lunch to improve 3D manipulation learning.
翻译:我们研究如何选择输入点云协调框架,从 3D 点云中了解操纵技能的影响。 存在各种协调框架选择,以使捕获的机器人- 物体互动点云正常化。 我们发现不同的框架对代理学习性能有深远影响,趋势在3D 的主干网络中类似。 特别是,最终效应框架和目标部分框架在许多任务中比通常使用的世界框架和机器人-基准框架实现了更高的培训效率, 直觉地说, 因为它们提供了跨时间步骤的点云之间有帮助的对齐, 从而可以简化视觉模块学习。 此外, 各种任务之间运作良好的框架各不相同, 一些任务可能受益于多个框架候选人。 因此, 我们提议框架框架框架框架框架框架框架人以适应性的方式选择候选框架,并以任务- 不可知的方式整合其优点。 实验性, 框架最小框架人能够实现比从ManiSkill 和OCRTOC 调整的五个完全物理操纵任务的最佳单一框架版本更高的业绩。 不改变现有的摄像头位置或增加额外的照相机, 点云框采矿可以作为免费午餐, 来改进3D 操纵学习。