Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (ie, an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation. Project page: https://judyye.github.io/affordiffusion-www
翻译:最近图像合成的成功都得益于大规模的扩散模型。然而,大多数方法目前仅限于以文本或图像为条件的整个图像合成、纹理转移或将物体插入用户指定的区域。相比之下,在本文中,我们专注于合成手与特定物体的复杂互动。给定一个物体的 RGB 图像,我们的目标是幻化一组与之互动的人类手图像。我们提出了一个两步生成方法:LayoutNet 采样一个关节不可知的手与物体互动布局,而 ContentNet 给定预测的布局,合成一个手握住物体的图像。两者都建立在大规模预训练的扩散模型之上,以利用其潜在表示。与基线相比,所提出的方法在新颖物体上表现更好,并且在野外环境下的便携式物体的成像上表现出了惊人的效果。 resulting 系统使我们能够预测描述性 affordance 信息,例如手的关节和靠近方向。 项目页面:https://judyye.github.io/affordiffusion-www