Text data augmentation, i.e., the creation of new textual data from an existing text, is challenging. Indeed, augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g., Machine Translation, Text Classification). Initially motivated by an application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic augmentation framework used as a service to augment English texts within our company. Our proposal combines different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.
翻译:增强文本数据,即从现有文本中创建新的文本数据,是具有挑战性的。事实上,增强性转换在与自然语言处理目标(例如机器翻译、文本分类)任务相关时,应当考虑到语言复杂性。最初的动机是应用商业电子邮件混集(BEC)探测技术,我们提出了一个用于在公司内部增加英文文本的实物和任务强化框架。我们的提案结合了不同的方法,利用BERT语言模型、多步后转和超常。我们表明,我们的增强性框架利用公开的模型和公司以及BEC检测任务,改进了若干文本分类任务的业绩。我们还就我们增强性框架的局限性提供了全面的论点。