Despite the well-developed cut-edge representation learning for language, most language representation models usually focus on specific levels of linguistic units. This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space. We propose the training objective MiSAD that utilizes meaningful n-grams extracted from large unlabeled corpus by a simple but effective algorithm for pre-trained language models. Then we empirically verify that well designed pre-training scheme may effectively yield universal language representation, which will bring great convenience when handling multiple layers of linguistic objects in a unified way. Especially, our model achieves the highest accuracy on analogy tasks in different language levels and significantly improves the performance on downstream tasks in the GLUE benchmark and a question answering dataset.
翻译:尽管对语言进行了成熟的尖端代表性学习,但大多数语言代表性模式通常侧重于语言单位的具体级别。这项工作引入了通用语言代表性学习,即将不同级别的语言单位或文本嵌入不同的矢量空间,其长度各不相同。我们提出培训目标,即利用从大型未加标记的文体中抽取的有意义的ngggs,通过简单而有效的算法,为经过培训的语文模式提取。然后,我们从经验上核实,设计良好的培训前计划可以有效地产生通用语言代表性,这将在以统一的方式处理多层语言对象时带来极大的便利。特别是,我们的模式在不同语言级别上实现了类比任务的最高精确度,大大改进了GLUE基准下游任务的业绩和回答数据集的问题。