Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (ie, an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation. Project page: https://judyye.github.io/affordiffusion-www
翻译:近期图像合成的成功是由大规模的情境扩散模型驱动的。但是,大多数方法目前仅限于针对整个图像、纹理转换或将物体插入到用户指定的区域的文本或图像有条件生成。相反,在本文中,我们专注于合成具有给定物体的复杂互动(即关节手)。给定一个物体的 RGB 图像,我们的目标是幻想出手与其互动的可信图像。我们提出了一种两步生成方法:LayoutNet 对一个关节不可知的手-物互动布局进行采样,ContentNet 对给定预测布局的图像进行合成。两者都建立在大规模预训练的情境扩散模型之上,以利用其潜在表示。与基线相比,所提出的方法表现出更好的泛化能力,可适用于新颖的物体,并在野外场景中表现出出乎意料的好结果。结果系统允许我们预测描述性的情境信息,例如手的关节和接近方向。项目网址:https://judyye.github.io/affordiffusion-www