Data augmentation techniques are widely used for enhancing the performance of machine learning models by tackling class imbalance issues and data sparsity. State-of-the-art generative language models have been shown to provide significant gains across different NLP tasks. However, their applicability to data augmentation for text classification tasks in few-shot settings have not been fully explored, especially for specialised domains. In this paper, we leverage GPT-2 (Radford A et al, 2019) for generating artificial training instances in order to improve classification performance. Our aim is to analyse the impact the selection process of seed training examples have over the quality of GPT-generated samples and consequently the classifier performance. We perform experiments with several seed selection strategies that, among others, exploit class hierarchical structures and domain expert selection. Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements and outperform competitive baselines. Finally, we show that guiding this process through domain expert selection can lead to further improvements, which opens up interesting research avenues for combining generative models and active learning.
翻译:数据增强技术被广泛用来提高机器学习模型的性能,解决阶级不平衡问题和数据广度问题; 事实证明,最先进的基因化语言模型在不同的国家劳工方案任务中取得了重大进步; 然而,在几个截图的环境下,数据对文本分类任务数据增强的适用性尚未得到充分探讨,特别是在专门领域。在本文件中,我们利用GPT-2(Radford A等人,2019)来生成人工培训实例,以提高分类性能。我们的目的是分析种子培训范例的选择过程对GPT产生的样本质量以及随后的分类员性能的影响。我们用几种种子选择战略进行实验,这些战略除其他外,利用了等级结构和域专家选择。我们的结果显示,在少数标签实例中微调GPT-2,导致一致的分类改进,并超越了竞争基准。 最后,我们表明,通过域专家选择指导这一进程可以导致进一步的改进,从而为将基因化模型和积极学习结合起来开辟有趣的研究途径。