The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision -- suggesting a simple data-driven definition of domains in textual data. We harness this property and propose domain data selection methods based on such models, which require only a small set of in-domain monolingual data. We evaluate our data selection methods for neural machine translation across five diverse domains, where they outperform an established approach as measured by both BLEU and by precision and recall of sentence selection with respect to an oracle.
翻译:NLP中的“域内数据”概念往往过于简单和模糊,因为文字数据在许多细微的语言方面,如专题、风格或形式水平等,在文字数据上各有不同之处。此外,域名标签许多次没有,因此难以建立特定域的系统。我们显示,大量经过培训的语文模型隐含着在没有监督的情况下按域分类的句子表示 -- -- 在文本数据中建议一个简单的数据驱动域的定义。我们利用这一属性,并提议以这种模型为基础的域数据选择方法,这只需要一小套内部单语数据。我们评价我们用于五个不同域的神经机器翻译的数据选择方法,在这些方面,它们超越了由BLEU衡量的既定方法,以及精确和回顾对一个符的句选方法。