Natural language is leveraged in many computer vision tasks such as image captioning, cross-modal retrieval or visual question answering, to provide fine-grained semantic information. While human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. In this work, we introduce the PoseScript dataset, which pairs a few thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. To increase the size of this dataset to a scale compatible with typical data hungry learning algorithms, we propose an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information -- the posecodes -- using a set of simple but generic rules on the 3D keypoints. The posecodes are then combined into higher level textual descriptions using syntactic rules. Automatic annotations substantially increase the amount of available data, and make it possible to effectively pretrain deep models for finetuning on human captions. To demonstrate the potential of annotated poses, we show applications of the PoseScript dataset to retrieval of relevant poses from large-scale datasets and to synthetic pose generation, both based on a textual pose description.
翻译:许多计算机视觉任务,如图像字幕、跨模式检索或视觉答题等,都利用自然语言的自然语言语言,以提供精密的语义信息。虽然人类的外观是人类理解的关键,但目前的3D人类的外观数据集缺乏详细的语言描述。在这项工作中,我们引入了PoseScript数据集,该数据集将数千张3D人类的外观与大量人体对身体部分及其空间关系的附加说明的描述相配。为了将这一数据集的大小提高到与典型的数据饥饿学习算法相容的规模,我们提议了一个精心的外观流程,从给定的3D 关键点生成自然语言的自动合成描述。这个流程利用3D 关键点上一套简单但通用的规则提取了低层次的外观数据集。然后,将外观代码组合成更高层次的文字描述,使用同步规则对身体部分进行大量的人文说明。自动说明大大增加了可用数据的数量,并有可能有效地为对人文字幕进行微调的深层模型。为了展示附加的外观外观,我们展示了从高层次的外观的外观到相关的合成结构图状图解的大规模数据检索。