Large-scale capture of human motion with diverse, complex scenes, while immensely useful, is often considered prohibitively costly. Meanwhile, human motion alone contains rich information about the scene they reside in and interact with. For example, a sitting human suggests the existence of a chair, and their leg position further implies the chair's pose. In this paper, we propose to synthesize diverse, semantically reasonable, and physically plausible scenes based on human motion. Our framework, Scene Synthesis from HUMan MotiON (SUMMON), includes two steps. It first uses ContactFormer, our newly introduced contact predictor, to obtain temporally consistent contact labels from human motion. Based on these predictions, SUMMON then chooses interacting objects and optimizes physical plausibility losses; it further populates the scene with objects that do not interact with humans. Experimental results demonstrate that SUMMON synthesizes feasible, plausible, and diverse scenes and has the potential to generate extensive human-scene interaction data for the community.
翻译:大型捕捉人类运动的场景多种多样,复杂多样,虽然非常有用,但往往被认为成本过高。 同时,单是人类运动本身就包含着关于他们居住和互动的场景的丰富信息。 例如,坐人表示椅子的存在,他们的腿姿势进一步意味着椅子的姿势。 在本文中,我们提议根据人类运动来综合各种不同、语义上合理和实际可信的场景。我们的框架,即HUMan MotiON(SUMMON)的Scene合成(SUMMON),包括两个步骤。它首先使用我们新引入的接触预测器Conder Former,从人类运动中获得时间上一致的接触标签。根据这些预测,SUMMON随后选择互动对象,并优化物理概率损失;它进一步将场景与不与人类互动的物体相聚在一起。实验结果显示SUMON综合了可行、可信和多样的场景,并有可能为社区产生广泛的人类互动数据。