Pre-trained language models (PLMs) have achieved remarkable success in natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are pre-trained in an unsupervised manner using the large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with labeled data (i.e., ``supervised pre-training'') showcase superior performance compared to unsupervised pre-trained models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training~(MVP) for natural language generation. We collect a large-scale natural language generation corpus, MVPCorpus, from $77$ datasets over $11$ diverse NLG tasks. Then we unify these examples into a general text-to-text format to pre-train the text generation model MVP in a supervised manner. For each task, we further pre-train specific soft prompts to stimulate the model's capacity to perform a specific task. Extensive experiments have demonstrated the effectiveness and generality of our MVP model in a number of NLG tasks, which achieves state-of-the-art performance on $13$ out of $17$ datasets.
翻译:培训前语言模式(PLM)在自然语言生成(NLG)任务方面取得了显著成功。到目前为止,大多数面向NLG的PLM在使用大型普通材料的情况下,以不受监督的方式进行了预先培训。与此同时,越来越多的以标签数据(即“监督培训前”)预先培训的模型(即“监督培训前”)展示了优于未经监督预先培训模式的绩效。在受监督的培训前成功推动下,我们提议为自然语言生成提供多任务超视超培训前~(MVP)能力。我们收集了一个大型的自然语言生成资料库(MVPCorpus),由超过1,100美元的不同NLG任务组成的数据集组成。然后,我们将这些实例统一成一个通用文本到文本格式,以监督的方式对文本生成模型MVP进行预先培训。我们进一步进行了培训,具体地简化了培训,以激励模型执行具体任务的能力。广泛的实验展示了我们13MVP美元模型的实效和一般性,从而完成了17G州数据模型的运行次数。