This paper proposes a simple yet effective interpolation-based data augmentation approach termed DoubleMix, to improve the robustness of models in text classification. DoubleMix first leverages a couple of simple augmentation operations to generate several perturbed samples for each training data, and then uses the perturbed data and original data to carry out a two-step interpolation in the hidden space of neural models. Concretely, it first mixes up the perturbed data to a synthetic sample and then mixes up the original data and the synthetic perturbed data. DoubleMix enhances models' robustness by learning the "shifted" features in hidden space. On six text classification benchmark datasets, our approach outperforms several popular text augmentation methods including token-level, sentence-level, and hidden-level data augmentation techniques. Also, experiments in low-resource settings show our approach consistently improves models' performance when the training data is scarce. Extensive ablation studies and case studies confirm that each component of our approach contributes to the final performance and show that our approach exhibits superior performance on challenging counterexamples. Additionally, visual analysis shows that text features generated by our approach are highly interpretable. Our code for this paper can be found at https://github.com/declare-lab/DoubleMix.git.
翻译:本文提出了一个简单而有效的基于内插的数据增强方法,称为“ 双混合 ”, 以提高文本分类模型的稳健性。 双混合首先利用几个简单的增强操作,为每项培训数据生成若干扰动样本,然后使用扰动数据和原始数据,在神经模型的隐藏空间中进行两步间插。 具体地说, 它首先将受扰动的数据与合成样本混在一起,然后将原始数据和合成环绕数据混在一起。 双混合通过学习隐蔽空间的“ 变换” 特征来增强模型的稳健性。 在六个文本分类基准数据集中,我们的方法优于几种受欢迎的文本增强方法,包括象征性级别、 句级和隐藏的数据增强技术。 此外, 低资源环境中的实验表明,当培训数据稀缺时,我们的方法始终在不断改进模型的性能。 广泛进行对比研究和案例研究证实,我们的方法的每个组成部分都有助于最终的性能,并显示我们的方法在具有挑战性的反examples上表现优。 此外, 视觉分析显示, 我们的文本特性是用于高清晰的 。