Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth. Inferring the depth of a person in an image, however, is fundamentally ambiguous without knowing their height. This is particularly problematic when the scene contains people of very different sizes, e.g. from infants to adults. To solve this, we need several things. First, we develop a novel method to infer the poses and depth of multiple people in a single image. While previous work that estimates multiple people does so by reasoning in the image plane, our method, called BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly reason about depth. BEV reasons simultaneously about body centers in the image and in depth and, by combing these, estimates 3D body position. Unlike prior work, BEV is a single-shot method that is end-to-end differentiable. Second, height varies with age, making it impossible to resolve depth without also estimating the age of people in the image. To do so, we exploit a 3D body model space that lets BEV infer shapes from infants to adults. Third, to train BEV, we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset that includes age labels and relative depth relationships between the people in the images. Extensive experiments on RH and AGORA demonstrate the effectiveness of the model and training scheme. BEV outperforms existing methods on depth reasoning, child shape estimation, and robustness to occlusion. The code and dataset are released for research purposes.
翻译:以多人为对象的图像, 我们的目标是直接回归所有人的形象和形状, 以及他们的相对深度。 但是, 以图像来推断一个人的深度, 却在不知晓其身高的情况下, 根本上是模糊不清的。 当现场包含不同大小的人, 例如从婴儿到成年人。 为了解决这个问题, 我们需要几种事情。 首先, 我们开发了一种新颖的方法, 以单一图像来推断多个人群的构成和深度。 虽然先前通过图像性平面的推理来估计多个人群的构成和深度, 我们的方法, 叫做 BEEV, 添加了一个额外的想象中的 Bird- Eye- View 深度表达方式, 以明确解释深度。 BEV 在图像和深度中同时说明身体中心的原因, 并且通过对3D 身体位置进行梳理。 与以前的工作不同, BEV 是一种从端到端的不同的方法。 第二, 高度与年龄不同, 在不同时估算图像年龄的情况下, 无法解决深度问题。 为了做到这一点, 我们利用一个三维体空间模型, 让 BEV 的模型来分析深度 将BeV 的深度 构建一个从Bereal 的 和A 数据 向成年人。