The current state-of-the-art model HiAGM for hierarchical text classification has two limitations. First, it correlates each text sample with all labels in the dataset which contains irrelevant information. Second, it does not consider any statistical constraint on the label representations learned by the structure encoder, while constraints for representation learning are proved to be helpful in previous work. In this paper, we propose HTCInfoMax to address these issues by introducing information maximization which includes two modules: text-label mutual information maximization and label prior matching. The first module can model the interaction between each text sample and its ground truth labels explicitly which filters out irrelevant information. The second one encourages the structure encoder to learn better representations with desired characteristics for all labels which can better handle label imbalance in hierarchical text classification. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed HTCInfoMax.
翻译:目前用于分级文本分类的HiAGM最新模型具有两个局限性。 首先,它将每个文本样本与数据集中包含不相关信息的所有标签联系起来。 其次,它不考虑对结构编码器所学标签表述的任何统计限制,虽然对代表学习的限制证明有助于先前的工作。 在本文件中,我们建议HTCInfoMax通过引入信息最大化来解决这些问题,包括两个模块:文本标签相互信息最大化和先前匹配的标签。 第一个模块可以建模每个文本样本与其明确过滤不相关信息的地面真实标签之间的互动关系。 第二个模块鼓励结构编码器学习所有标签的更好表达方式,这些标签可以更好地处理分级文本分类中的标签不平衡。 两个基准数据集的实验结果显示了拟议的 HTCInfoMax 的有效性。