Hierarchical Text Classification (HTC) is a challenging task where a document can be assigned to multiple hierarchically structured categories within a taxonomy. The majority of prior studies consider HTC as a flat multi-label classification problem, which inevitably leads to "label inconsistency" problem. In this paper, we formulate HTC as a sequence generation task and introduce a sequence-to-tree framework (Seq2Tree) for modeling the hierarchical label structure. Moreover, we design a constrained decoding strategy with dynamic vocabulary to secure the label consistency of the results. Compared with previous works, the proposed approach achieves significant and consistent improvements on three benchmark datasets.
翻译:等级文字分类(HTC)是一项具有挑战性的任务,文件可以分配给分类中多等级结构分类的多个类别。大多数先前的研究认为,HTC是一个平坦的多标签分类问题,不可避免地导致“标签不一致”问题。在本文件中,我们将HTC作为一种序列生成任务来制定,并引入一个从顺序到树木的框架(Seq2Tree),以模拟等级标签结构。此外,我们设计了一个有动态词汇的限制性解码战略,以确保结果标签的一致性。与以往的工作相比,拟议方法在三个基准数据集上取得了显著和一致的改进。