The quality of artificially generated texts has considerably improved with the advent of transformers. The question of using these models to generate learning data for supervised learning tasks naturally arises. In this article, this question is explored under 3 aspects: (i) are artificial data an efficient complement? (ii) can they replace the original data when those are not available or cannot be distributed for confidentiality reasons? (iii) can they improve the explainability of classifiers? Different experiments are carried out on Web-related classification tasks -- namely sentiment analysis on product reviews and Fake News detection -- using artificially generated data by fine-tuned GPT-2 models. The results show that such artificial data can be used in a certain extend but require pre-processing to significantly improve performance. We show that bag-of-word approaches benefit the most from such data augmentation.
翻译:随着变压器的出现,人工生成的文本的质量已大为改善。使用这些模型生成学习数据用于监督的学习任务自然会产生问题。在本条中,这个问题在三个方面下探讨:(一)人工数据是有效的补充吗? (二)当由于保密原因无法获得或不能分发原始数据时,它们能否取代原始数据? (三)它们能否改善分类者的可解释性?在与网络有关的分类任务上进行了不同的实验 -- -- 即产品审查的情绪分析和假新闻探测 -- -- 使用经过微调的GPT-2模型人工生成的数据。结果显示,这种人工数据可以在某些方面使用,但需要预先处理才能大大改进性能。我们显示,一包字方法最能从这类数据扩充中受益。