Hierarchical text classification (HTC) is essential for various real applications. However, HTC models are challenging to develop because they often require processing a large volume of documents and labels with hierarchical taxonomy. Recent HTC models based on deep learning have attempted to incorporate hierarchy information into a model structure. Consequently, these models are challenging to implement when the model parameters increase for a large-scale hierarchy because the model structure depends on the hierarchy size. To solve this problem, we formulate HTC as a sub-hierarchy sequence generation to incorporate hierarchy information into a target label sequence instead of the model structure. Subsequently, we propose the Hierarchy DECoder (HiDEC), which decodes a text sequence into a sub-hierarchy sequence using recursive hierarchy decoding, classifying all parents at the same level into children at once. In addition, HiDEC is trained to use hierarchical path information from a root to each leaf in a sub-hierarchy composed of the labels of a target document via an attention mechanism and hierarchy-aware masking. HiDEC achieved state-of-the-art performance with significantly fewer model parameters than existing models on benchmark datasets, such as RCV1-v2, NYT, and EURLEX57K.
翻译:然而,HTC模型的开发具有挑战性,因为它们往往需要处理大量带有等级分类的文档和标签。最近基于深层次学习的HTC模型试图将等级信息纳入一个模型结构。因此,这些模型具有挑战性,在模型结构取决于等级大小而导致大规模等级结构的模型参数增加时,要实施。为了解决这一问题,我们将HTC作为分等级序列生成,以便通过关注机制和分级保护结构将等级信息纳入一个目标标签序列,而不是模型结构。随后,我们建议HDEC(HDEC)将文本序列解码成一个子等级序列,使用循环等级分级分解法,将同一级别的所有父母一次性分类为子女。此外,HIDC经过培训,在一个子等级分级结构中使用从根到每个叶的等级路径信息,该分类由目标文件的标签组成,通过关注机制和分级保护模式和分级保护模式。HDEC已经实现的文本序列在次等级序列中进行解码,其格式比现有的标准模型要少得多。