PoseFusion: SelectLSTM 实现强大的物体在手姿态估计 (PoseFusion: Robust Object-in-Hand Pose Estimation with SelectLSTM)

Accurate estimation of the relative pose between an object and a robot hand is critical for many manipulation tasks. However, most of the existing object-in-hand pose datasets use two-finger grippers and also assume that the object remains fixed in the hand without any relative movements, which is not representative of real-world scenarios. To address this issue, a 6D object-in-hand pose dataset is proposed using a teleoperation method with an anthropomorphic Shadow Dexterous hand. Our dataset comprises RGB-D images, proprioception and tactile data, covering diverse grasping poses, finger contact states, and object occlusions. To overcome the significant hand occlusion and limited tactile sensor contact in real-world scenarios, we propose PoseFusion, a hybrid multi-modal fusion approach that integrates the information from visual and tactile perception channels. PoseFusion generates three candidate object poses from three estimators (tactile only, visual only, and visuo-tactile fusion), which are then filtered by a SelectLSTM network to select the optimal pose, avoiding inferior fusion poses resulting from modality collapse. Extensive experiments demonstrate the robustness and advantages of our framework. All data and codes are available on the project website: https://elevenjiang1.github.io/ObjectInHand-Dataset/

翻译：准确估计物体与机器人手之间的相对姿态对于许多操作任务至关重要。然而，大多数现有的物体在手姿态数据集使用两指夹持器，并且还假设物体保持在手中而没有任何相对运动，这并不代表真实世界的情况。为解决这个问题，我们提出了一种基于模拟遥控的人形影子手的六维物体在手姿态数据集。我们的数据集包括 RGB-D 图像、本体感知和触觉数据，涵盖了不同的抓握姿势、手指接触状态和物体遮挡。为了克服真实场景中显著的手部遮挡和有限的触觉传感器接触，我们提出了 PoseFusion，这是一种混合多模态融合方法，它将视觉和触觉感知通道的信息集成在一起。PoseFusion 从三个估计器（仅触觉、仅视觉和视觉触觉融合）生成三个候选物体姿态，然后通过 SelectLSTM 网络过滤选择最佳姿态，避免由于模态崩溃而导致的次优融合姿态。广泛的实验证明了我们框架的鲁棒性和优点。所有数据和代码都可在项目网站上获得：https://elevenjiang1.github.io/ObjectInHand-Dataset/