Human Pose Estimation (HPE) based on RGB images has experienced a rapid development benefiting from deep learning. However, event-based HPE has not been fully studied, which remains great potential for applications in extreme scenes and efficiency-critical conditions. In this paper, we are the first to estimate 2D human pose directly from 3D event point cloud. We propose a novel representation of events, the rasterized event point cloud, aggregating events on the same position of a small time slice. It maintains the 3D features from multiple statistical cues and significantly reduces memory consumption and computation complexity, proved to be efficient in our work. We then leverage the rasterized event point cloud as input to three different backbones, PointNet, DGCNN, and Point Transformer, with two linear layer decoders to predict the location of human keypoints. We find that based on our method, PointNet achieves promising results with much faster speed, whereas Point Transfomer reaches much higher accuracy, even close to previous event-frame-based methods. A comprehensive set of results demonstrates that our proposed method is consistently effective for these 3D backbone models in event-driven human pose estimation. Our method based on PointNet with 2048 points input achieves 82.46mm in MPJPE3D on the DHP19 dataset, while only has a latency of 12.29ms on an NVIDIA Jetson Xavier NX edge computing platform, which is ideally suitable for real-time detection with event cameras. Code is available at https://github.com/MasterHow/EventPointPose.
翻译:以 RGB 图像为基础的人类粒子估计( HPE ) 快速发展,得益于深层次的学习。 但是,基于事件的 HPE 还没有得到充分的研究, 仍然极有可能在极端的场景和效率临界条件下应用。 在本文中, 我们首先从 3D 事件点云中直接估算 2D 人构成。 我们建议对事件进行新的描述, 将事件点云集中在一个小时间片的位置上, 将事件集中在一起。 它保持多个统计线索的3D 特征, 并大大降低记忆消耗和计算的复杂性, 事实证明这对我们的工作是有效的。 然后, 我们利用以 3D RPE 点点作为在三个不同的骨干上的投入, 即PointNet、 DGCNN 和 Point Transfer 。 我们发现,根据我们的方法, 点网能以更快的速度取得有希望的结果, 而点 Transfomermer 则达到更高的准确度, 甚至接近以往基于事件框架的平台。 一套全面的结果显示, 我们提出的方法对于这3D RPI 节点在事件深度的 RPE 3 值模型中, 在活动驱动的 DMD 数据中, 我们的方法只有 12 RPE 10 数据 。