Natural language understanding (NLU) is the task of semantic decoding of human languages by machines. NLU models rely heavily on large training data to ensure good performance. However, substantial languages and domains have very few data resources and domain experts. It is necessary to overcome the data scarcity challenge, when very few or even zero training samples are available. In this thesis, we focus on developing cross-lingual and cross-domain methods to tackle the low-resource issues. First, we propose to improve the model's cross-lingual ability by focusing on the task-related keywords, enhancing the model's robustness and regularizing the representations. We find that the representations for low-resource languages can be easily and greatly improved by focusing on just the keywords. Second, we present Order-Reduced Modeling methods for the cross-lingual adaptation, and find that modeling partial word orders instead of the whole sequence can improve the robustness of the model against word order differences between languages and task knowledge transfer to low-resource languages. Third, we propose to leverage different levels of domain-related corpora and additional masking of data in the pre-training for the cross-domain adaptation, and discover that more challenging pre-training can better address the domain discrepancy issue in the task knowledge transfer. Finally, we introduce a coarse-to-fine framework, Coach, and a cross-lingual and cross-domain parsing framework, X2Parser. Coach decomposes the representation learning process into a coarse-grained and a fine-grained feature learning, and X2Parser simplifies the hierarchical task structures into flattened ones. We observe that simplifying task structures makes the representation learning more effective for low-resource languages and domains.
翻译:自然语言理解( NLU) 是用机器对人种语言进行语义解码的任务。 NLU 模型主要依赖大型培训数据, 以确保良好的表现。 但是, 大量语言和域名拥有的数据资源和域专家非常少。 有必要克服数据稀缺的挑战, 当只有很少甚至甚至零的培训样本的时候。 在这个论文中, 我们侧重于开发跨语言和跨方向的方法来解决低资源问题。 首先, 我们提议提高模型的跨语言能力, 侧重于与任务相关的关键字, 增强模型的坚固性, 使模型的表达方式更加正规化, 使模型的表达方式更加简单。 但是, 我们发现低资源语言的表达方式和域名的表达方式可以通过仅仅关注关键字和域名来容易和大大改进。 其次, 我们为跨语言的适应, 建模部分文字秩序, 而不是整个顺序可以提高模型的强度, 语言前语言和任务知识传输到低资源语言。 第三, 我们建议利用不同层次的阶调水平, 将更多的数据隐藏到更稳定的结构中,, 将跨层次的学习过程的学习过程结构, 将一个更具有挑战性的任务 。