含有语义控制的构成性人- 人- 人- 人- 环境相互作用合成 (Compositional Human-Scene Interaction Synthesis with Semantic Control)

Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. Our goal is to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications as pairs of action categories and object instances, e.g., "sit on the chair". The key challenge of incorporating interaction semantics into the generation framework is to learn a joint representation that effectively captures heterogeneous information, including human body articulation, 3D object geometry, and the intent of the interaction. To address this challenge, we design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are embedded via positional encoding. Furthermore, inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs. Our proposed generative model can naturally incorporate varying numbers of atomic interactions, which enables synthesizing compositional human-scene interactions without requiring composite interaction data. We extend the PROX dataset with interaction semantic labels and scene instance segmentation to evaluate our method and demonstrate that our method can generate realistic human-scene interactions with semantic control. Our perceptual study shows that our synthesized virtual humans can naturally interact with 3D scenes, considerably outperforming existing methods. We name our method COINS, for COmpositional INteraction Synthesis with Semantic Control. Code and data are available at https://github.com/zkf1997/COINS.

翻译：将虚拟人类及其 3D 环境之间的自然互动同步化对于许多应用( 如计算机游戏和AR/VR 经验) 至关重要。我们的目标是将人类与以高层次语义规格控制的 3D 场景作为一对动作类别和对象实例( 如“ 坐在椅子上 ” ) 的组合组合。将互动语义纳入生成框架的关键挑战在于学习一种能够有效获取各种信息的联合表达式, 包括人体表达、 3D 对象几何以及互动的意图。为了应对这一挑战, 我们设计了一个基于新颖变异器的基因化模型, 其中3D 人的身体表面点和 3D 对象在统一的隐蔽空间中被联合编码, 而人与对象之间的互动的语义性通过定位编码嵌入。此外, 人类可以同时与多个对象互动的构成性, 我们定义了互动的语义, 与原子动作- 目标对立配对。我们提议的基因化模型可以自然地包含不同数量的原子互动数字数字, 使得我们无法使用复合的语系数据界面互动。