Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling of common labels. Here, we introduce the application of balancing loss functions for multi-label text classification. We perform experiments on a general domain dataset with 90 labels (Reuters-21578) and a domain-specific dataset from PubMed with 18211 labels. We find that a distribution-balanced loss function, which inherently addresses both the class imbalance and label linkage problems, outperforms commonly used loss functions. Distribution balancing methods have been successfully used in the image recognition field. Here, we show their effectiveness in natural language processing. Source code is available at https://github.com/Roche/BalancedLossNLP.
翻译:多标签文本分类是一项具有挑战性的任务,因为它需要捕捉标签依赖性。 当分类分布长时,它就更具挑战性。 重新抽样和重新加权是用来解决分类不平衡问题的常见方法, 但是,当除了分类不平衡之外还有标签依赖性, 因为它们导致过度采样通用标签时,它们就无效。 这里, 我们引入了多标签文本分类中平衡损失功能的应用。 我们实验了一个通用域数据集, 包含90个标签( Reuters-21578) 和来自 PubMed 18211 标签的域名数据集。 我们发现一个分布平衡损失功能, 它必然解决了分类不平衡和标签连接问题, 优于常用的损耗函数。 分布平衡方法在图像识别字段中已被成功使用 。 在此, 我们展示了它们在自然语言处理中的有效性 。 源代码可在 https://github.com/Roche/BalancedLosNLP 上查阅 。