Generating coherent and useful image/video scenes from a free-form textual description is technically a very difficult problem to handle. Textual description of the same scene can vary greatly from person to person, or sometimes even for the same person from time to time. As the choice of words and syntax vary while preparing a textual description, it is challenging for the system to reliably produce a consistently desirable output from different forms of language input. The prior works of scene generation have been mostly confined to rigorous sentence structures of text input which restrict the freedom of users to write description. In our work, we study a new pipeline that aims to generate static as well as animated 3D scenes from different types of free-form textual scene description without any major restriction. In particular, to keep our study practical and tractable, we focus on a small subspace of all possible 3D scenes, containing various combinations of cube, cylinder and sphere. We design a two-stage pipeline. In the first stage, we encode the free-form text using an encoder-decoder neural architecture. In the second stage, we generate a 3D scene based on the generated encoding. Our neural architecture exploits state-of-the-art language model as encoder to leverage rich contextual encoding and a new multi-head decoder to predict multiple features of an object in the scene simultaneously. For our experiments, we generate a large synthetic data-set which contains 13,00,000 and 14,00,000 samples of unique static and animated scene descriptions, respectively. We achieve 98.427% accuracy on test data set in detecting the 3D objects features successfully. Our work shows a proof of concept of one approach towards solving the problem, and we believe with enough training data, the same pipeline can be expanded to handle even broader set of 3D scene generation problems.
翻译:从自由格式D的文本描述中产生一致和有用的图像/视频场景,从技术上讲,这是一个非常困难的问题。对同一场景的文字描述会因人而异,有时甚至对同一人的文字描述也会不时发生很大差异。由于文字和语法的选择在准备文本描述时各不相同,系统很难可靠地从不同形式的语言输入中产生一致的可取产出。先前的场景生成工作主要局限于限制用户写描述自由的严格句式结构。在我们的工作中,我们研究一个新的管道,目的是从不同类型的自由格式文本描述中产生静态和动态的3D场景,而没有重大限制。特别是,由于我们学习语言和语法的选用方式不同,我们专注于所有可能的 3D场景的小型子空间,包含不同形式的立体、圆柱形和球体的组合。我们设计了一个两阶段的管道。在第一阶段,我们用一个解码文字的文字解码文字结构,我们用一个模型,我们用一个3D场景的图像场景的场景图,我们用一个模型, 用来解算算出一个新的变形模型, 将一个数字的模型,我们的数据结构, 将一个进化到一个新的变形的图像的模型,我们的数据结构, 将一个进化成的模型, 将一个进化到一个进化到一个进化到一个新的的变形的变形的变形的图像的变形的系统。