Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.
翻译:自然语言生成(NLG) 接受以图像、视频或文本为形式的输入数据,并生成相应的自然语言文本作为输出。 现有的 NLG 方法主要采取监管方法,并大量依赖数据对文本的配对。 但是,对于许多有针对性的情景和非英语语言,通常没有足够数量的标签数据。 为了放松对下游任务标签数据的依赖, 我们提议一个直观和有效的零射学习框架, ZeroNLG, 它可以处理多个 NLG 任务, 包括图像对文本( 模拟说明)、 视频对文本( 视频说明) 和文本( 视频对文本) 。 但是, 对于英语、 中文、 德文和法文, ZeroNLG 通常不需要任何贴标签的下游配对培训。 在培训期间, ZeroNLG (i) 将不同的领域( 交错模式和语言) 投放到共享的隐形空间空间空间空间空间中的相应坐标( ) 连接不同的域; 在通用文本中, 将一个未加固的图像的文本转换成数据输入, 。</s>