PoseKernelLifter: 使用声音将 3D 人类脉冲用米介 (PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound)

Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem. For example, we can not measure the exact distance of a person to the camera from a single view image without additional scene assumptions (e.g., known height). Existing learning based approaches circumvent this issue by reconstructing the 3D pose up to scale. However, there are many applications such as virtual telepresence, robotics, and augmented reality that require metric scale reconstruction. In this paper, we show that audio signals recorded along with an image, provide complementary information to reconstruct the metric 3D pose of the person. The key insight is that as the audio signals traverse across the 3D space, their interactions with the body provide metric information about the body's pose. Based on this insight, we introduce a time-invariant transfer function called pose kernel -- the impulse response of audio signals induced by the body pose. The main properties of the pose kernel are that (1) its envelope highly correlates with 3D pose, (2) the time response corresponds to arrival time, indicating the metric distance to the microphone, and (3) it is invariant to changes in the scene geometry configurations. Therefore, it is readily generalizable to unseen scenes. We design a multi-stage 3D CNN that fuses audio and visual signals and learns to reconstruct 3D pose in a metric scale. We show that our multi-modal method produces accurate metric reconstruction in real world scenes, which is not possible with state-of-the-art lifting approaches including parametric mesh regression and depth regression.

翻译：从单一视图图像重构一个3D的人的立体结构, 从一个图像重构一个尺寸的人的立体结构是一个几何错误的问题。例如, 我们无法测量一个人与摄像头的准确距离, 而没有额外的场景假设( 例如已知高度 ) 。现有的基于学习的方法通过重建立体结构, 从而绕过这一问题。但是, 有许多应用程序, 如虚拟远程现场、机器人以及扩大现实, 需要以尺度重建。在本文中, 我们显示, 记录到的音频信号和图像是高度关联的, 为重建3D的立体形象提供了补充信息。关键洞察力是, 当3D空间的音频信号横跨一个三维图像图像图像时, 他们和身体的相互作用提供了体形信息。基于这个洞察, 我们引入了一个时间变异的转移功能, 也就是由立体的音频信号的脉冲反应。组合的主要特征是:(1) 其信封与3D显示的高度关联性, (2) 时间与到达的时间对应, 显示的是, 它在3D 的到达时间距离上, 显示的是, 方向的直径距离是真实的深度, 它的直径的深度,, 我们的深度, 我们的深度, 的深度, 的深度, 我们的深度, 的深度, 显示, 我们的深度, 的深度, 我们的深度, 我们的深度, 方向的深度, 显示, 方向的深度, 方向的深度, 我们的深度, 的深度, 方向的深度, 的深度, 的深度, 的深度, 的深度, 的深度, 我们的深度, 我们的深度, 我们的深度, 我们的深度, 的深度, 我们的深度, 的深度, 的深度, 我们的深度, 我们的深度, 我们的深度, 的深度, 的深度, 的深度, 我们的深度, 的深度, 的深度, 的深度, 的深度, 我们的深度, 的深度, 的深度, 的深度, 我们的深度, 我们的深度, 的深度, 的深度, 的深度, 的深度, 的深度, 我们的深度, 的深度, 的深度, 的深度, 的深度, 的深度, 深度,,