3D hand pose estimation from monocular videos is a long-standing and challenging problem, which is now seeing a strong upturn. In this work, we address it for the first time using a single event camera, i.e., an asynchronous vision sensor reacting on brightness changes. Our EventHands approach has characteristics previously not demonstrated with a single RGB or depth camera such as high temporal resolution at low data throughputs and real-time performance at 1000 Hz. Due to the different data modality of event cameras compared to classical cameras, existing methods cannot be directly applied to and re-trained for event streams. We thus design a new neural approach which accepts a new event stream representation suitable for learning, which is trained on newly-generated synthetic event streams and can generalise to real data. Experiments show that EventHands outperforms recent monocular methods using a colour (or depth) camera in terms of accuracy and its ability to capture hand motions of unprecedented speed. Our method, the event stream simulator and the dataset will be made publicly available.
翻译:单体视频的3D手面面估计是一个长期存在且具有挑战性的问题, 这个问题现在正在出现强劲的上升。 在这项工作中, 我们第一次使用单一的事件相机来解决这个问题, 即一个对亮度变化作出反应的无同步的视觉传感器。 我们的“ 事件图案” 方法具有以前没有用单一的 RGB 或深度相机来显示的特性, 例如低数据通过量的高时间分辨率和1000赫兹的实时性能。 由于事件相机与古典相机相比的数据模式不同, 现有的方法无法直接应用于事件流并重新训练。 因此, 我们设计一种新的神经方法, 接受适合学习的新事件流代表, 以新生成的合成事件流为培训, 可以概括真实数据。 实验显示“ 事件图案” 用彩色( 或深度) 相机在准确性和以前所未有的速度捕捉手动的能力方面超越了最新的单体方法。 我们的方法、 事件流模拟器和数据集将被公开提供。