利用自然监督促进语言代表学习和生成 (Leveraging Natural Supervision for Language Representation Learning and Generation)

Recent breakthroughs in Natural Language Processing (NLP) have been driven by language models trained on a massive amount of plain text. While powerful, deriving supervision from textual resources is still an open question. For example, language model pretraining often neglects the rich, freely-available structures in textual data. In this thesis, we describe three lines of work that seek to improve the training and evaluation of neural models using naturally-occurring supervision. We first investigate self-supervised training losses to help enhance the performance of pretrained language models for various NLP tasks. Specifically, we alter the sentence prediction loss to make it better suited to other pretraining losses and more challenging to solve. We design an intermediate finetuning step that uses self-supervised training to promote models' ability in cross-task generalization. Then we describe methods to leverage the structures in Wikipedia and paraphrases. In particular, we propose training losses to exploit hyperlinks, article structures, and article category graphs for entity-, discourse-, entailment-related knowledge. We propose a framework that uses paraphrase pairs to disentangle semantics and syntax in sentence representations. We extend the framework for a novel generation task that controls the syntax of output text with a sentential exemplar. Lastly, we discuss our work on tailoring textual resources for establishing challenging evaluation tasks. We introduce three datasets by defining novel tasks using various fan-contributed websites, including a long-form data-to-text generation dataset, a screenplay summarization dataset, and a long-form story generation dataset. These datasets have unique characteristics offering challenges to future work in their respective task settings.

翻译：最近自然语言处理(NLP)的突破是由大量普通文本培训的语言模型驱动的。虽然从文本资源中引出的监管能力强大, 仍是一个尚未解决的问题。例如, 语言模型预修往往忽视了文本数据中丰富的、可自由获得的结构。在此论文中, 我们描述三行工作, 目的是利用自然产生的监管来改进神经模型的培训和评估。我们首先调查自我监督的培训损失, 以帮助提高为国家语言处理任务培训前语言模型的性能。具体地说, 我们改变句数预测损失, 使之更适合其他预培训损失, 并且更具有解决的难度。我们设计了一个中间调整步骤, 使用自监督的培训来提高模型在交叉任务中的能力。然后我们描述如何利用维基百和语音结构来改进神经模型的培训和评估。我们提出培训损失, 利用超链接、文章结构, 和文章类别图表, 需要与独特任务相关的知识。我们提出一个框架, 使用对语言结构进行总结, 来调解析结构, 并用新版本数据生成流程, 提供输出任务。