Text classification is a widely studied problem and has broad applications. In many real-world problems, the number of texts for training classification models is limited, which renders these models prone to overfitting. To address this problem, we propose SSL-Reg, a data-dependent regularization approach based on self-supervised learning (SSL). SSL is an unsupervised learning approach which defines auxiliary tasks on input data without using any human-provided labels and learns data representations by solving these auxiliary tasks. In SSL-Reg, a supervised classification task and an unsupervised SSL task are performed simultaneously. The SSL task is unsupervised, which is defined purely on input texts without using any human-provided labels. Training a model using an SSL task can prevent the model from being overfitted to a limited number of class labels in the classification task. Experiments on 17 text classification datasets demonstrate the effectiveness of our proposed method.
翻译:文本分类是一个广泛研究的问题,具有广泛的应用性。在许多现实世界问题中,培训分类模型的文本数量有限,使得这些模型容易被过度使用。为了解决这一问题,我们提议采用基于自我监督学习(SSL)的数据依赖性规范化方法SSL-Reg。SSL是一种不受监督的学习方法,它界定了输入数据的辅助任务,而没有使用任何人类提供的标签,并且通过解决这些辅助任务来学习数据表述。在SSL-Reg中,监督的分类任务和未经监督的SSL任务同时执行。SSL任务不受监督,它纯粹在输入文本上定义,而没有使用任何人类提供的标签。使用SSL任务培训模型可以防止该模型在分类任务中被过度使用数量有限的类标签。对17个文本分类数据集的实验显示了我们拟议方法的有效性。