Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.
翻译:在自然语言应用中,深神经网络和巨大的语言模型正在变得无处不在。由于它们以需要大量培训数据而著称,为改善低资源环境中的绩效而开展的工作越来越多。受最近对神经模型的根本性变化以及流行的预先培训和微调范式的驱动,我们调查了低资源自然语言处理的有希望的方法。在讨论了数据提供的不同层面之后,我们从结构上概述了在培训数据稀少时能够学习的方法。这包括建立额外标签数据的机制,如数据扩增和远程监督,以及转移学习环境,以减少对目标监督的需要。我们调查的一个目的是解释这些方法在需求上有何不同,因为了解这些方法对于选择适合特定低资源环境的技术至关重要。这项工作的更多关键方面是突出开放的问题和为未来研究勾画有希望的方向。