Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework can generate more diverse and realistic human images compared to state-of-the-art methods.
翻译:在视觉和图形中,产生高质量和多样化的人类图像是一项重要但富有挑战性的任务。然而,在服装形状和纹理的高度多样性下,现有的基因化模型往往不尽人意。此外,生成过程甚至希望能够直观地控制非专业用户。在这项工作中,我们为高质量和多样化的人类生成提供了一个由文本驱动的控制框架Text2Human。我们从一个给定的人的表面开始,用两个专门步骤将完整的人体图像综合起来。1 一些描述服装形状的文字,给定的人类姿势首先翻译为现实化的地图。2 最终的人类行情模型则通过向系统提供更多关于服装纹理的属性。具体地说,为了模拟服装纹理的多样化,我们制作一个等级分级的纹理感化代码,储存多种规模的神经图解。 粗皮层的代码包括素质结构的描述,而精细层次的代码则侧重于文本的精细化细节。为了将精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的度, 将精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的脚的精细的精细的精细的精细的精细的精细的精细的精细的精细的脚的精细的精细的精细的精细的精细的脚的精细的精细的脚的精细的脚的精细的精细的脚的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的脚的精细的精细的脚的脚的精细的精细的精细的精细的精细的精细的精细细的脚的