Despite recent advancements in Machine Learning, many tasks still involve working in low-data regimes which can make solving natural language problems difficult. Recently, a number of text augmentation techniques have emerged in the field of Natural Language Processing (NLP) which can enrich the training data with new examples, though they are not without their caveats. For instance, simple rule-based heuristic methods are effective, but lack variation in semantic content and syntactic structure with respect to the original text. On the other hand, more complex deep learning approaches can cause extreme shifts in the intrinsic meaning of the text and introduce unwanted noise into the training data. To more reliably control the quality of the augmented examples, we introduce a state-of-the-art approach for Self-Controlled Text Augmentation (STA). Our approach tightly controls the generation process by introducing a self-checking procedure to ensure that generated examples retain the semantic content of the original text. Experimental results on multiple benchmarking datasets demonstrate that STA substantially outperforms existing state-of-the-art techniques, whilst qualitative analysis reveals that the generated examples are both lexically diverse and semantically reliable.
翻译:尽管在机器学习方面最近有所进展,但许多任务仍涉及在低数据制度下开展工作,从而难以解决自然语言问题。最近,在自然语言处理(NLP)领域出现了一些增强文本的技术,这些技术能够以新的实例丰富培训数据,尽管它们并非没有附加说明。例如,简单的基于规则的文艺方法是有效的,但与原始文本相比,在语义内容和综合结构方面缺乏差异。另一方面,更为复杂的深层次学习方法可能会造成文本内在含义的极端变化,并在培训数据中引入不必要的噪音。为了更可靠地控制扩充实例的质量,我们引入了一种最先进的自控文本放大(STA)方法。我们的方法严格控制了生成过程,采用了一种自检程序,以确保生成的示例保留原始文本的语义内容。多个基准数据集的实验结果表明,STA大大超越了现有的状态技术,而定性分析则表明,生成的例子在词汇上是多样的,在结构上是可靠的。</s>