Recent works have empirically shown the effectiveness of data augmentation (DA) in NLP tasks, especially for those suffering from data scarcity. Intuitively, given the size of generated data, their diversity and quality are crucial to the performance of targeted tasks. However, to the best of our knowledge, most existing methods consider only either the diversity or the quality of augmented data, thus cannot fully mine the potential of DA for NLP. In this paper, we present an easy and plug-in data augmentation framework EPiDA to support effective text classification. EPiDA employs two mechanisms: relative entropy maximization (REM) and conditional entropy minimization (CEM) to control data generation, where REM is designed to enhance the diversity of augmented data while CEM is exploited to ensure their semantic consistency. EPiDA can support efficient and continuous data generation for effective classifier training. Extensive experiments show that EPiDA outperforms existing SOTA methods in most cases, though not using any agent networks or pre-trained generation networks, and it works well with various DA algorithms and classification models. Code is available at https://github.com/zhaominyiz/EPiDA.
翻译:最近的工作从经验上表明,国家实验室规划任务中的数据增强(DA)是有效的数据增强框架(EPiDA),支持有效的文本分类。EPiDA采用两种机制:相对的最小化(REM)和有条件的最小化(CEM)来控制数据生成,其中REM旨在增加数据的多样性和分类模式,同时利用CEM来确保其语义一致性。 EPiDA可以支持高效和连续的数据生成,以便进行有效的分类培训。广泛的实验表明,EPiDA在多数情况下都超越了现有的SOTA方法,尽管没有使用任何代理网络或受过培训的一代网络,而且它与各种DA算法和分类模型合作良好。 httpss://githubcom https://giz/simyubcom。