In this paper, we investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task. We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space; specifically - (i) semantically similar samples that belong to different classes and (ii) semantically dissimilar samples that belong to the same class. We propose a penalty function to measure the overall difficulty score of every sample in the dataset. We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9% and discuss qualitative results to show effectiveness of our approach in identifying difficult samples for a text classification model.
翻译:在本文中,我们研究了从特定文本数据集中处理困难样本对下游文本分类任务的影响,我们通过在语义嵌入空间分析这些样本,将困难样本界定为非明显文本分类案例;具体地说,(一) 属于不同类别、性质上相似的样本,(二) 属于同一类别、性质上不同的样本,(二) 属于同一类别的样本,我们建议了一种惩罚功能,以衡量数据集中每个样本的总体难度得分。我们对13个标准数据集进行了详尽的实验,以显示9%的一致改进,并讨论了质量结果,以表明我们为文本分类模型确定困难样本的方法的有效性。