Classification tasks require a balanced distribution of data to ensure the learner to be trained to generalize over all classes. In real-world datasets, however, the number of instances vary substantially among classes. This typically leads to a learner that promotes bias towards the majority group due to its dominating property. Therefore, methods to handle imbalanced datasets are crucial for alleviating distributional skews and fully utilizing the under-represented data, especially in text classification. While addressing the imbalance in text data, most methods utilize sampling methods on the numerical representation of the data, which limits its efficiency on how effective the representation is. We propose a novel training method, Sequential Targeting(ST), independent of the effectiveness of the representation method, which enforces an incremental learning setting by splitting the data into mutually exclusive subsets and training the learner adaptively. To address problems that arise within incremental learning, we apply elastic weight consolidation. We demonstrate the effectiveness of our method through experiments on simulated benchmark datasets (IMDB) and data collected from NAVER.
翻译:分类任务要求均衡地分配数据,以确保受训的学习者能够对所有类别进行普及。但是,在现实世界的数据集中,不同类别的情况数量差别很大。这通常导致学习者由于其占优势属性而偏向多数群体。因此,处理不平衡的数据集的方法对于缓解分布偏差和充分利用代表性不足的数据至关重要,特别是在文本分类方面。在解决文本数据不平衡的同时,大多数方法使用数据数字代表的抽样方法,这限制了数据代表效率。我们建议一种新的培训方法,即按顺序设定目标方法,与代表方法的有效性无关,通过将数据分成相互排斥的子集,对学习者进行适应性培训,从而实施一种渐进式学习环境。为了解决在逐步学习过程中出现的问题,我们采用弹性权重整合。我们通过模拟基准数据集试验和从NAVER收集的数据来证明我们的方法的有效性。