Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.
翻译:将文件分类为特定标签等级具有直觉的吸引力,因为在大量文本公司中,等级主题结构普遍存在。虽然相关研究在充分监督的等级文件分类中取得了令人满意的业绩,但通常需要大量的人类附加说明的培训数据,并且只使用文本信息。然而,在许多领域,(1) 说明非常昂贵,因为可以获得的培训样本很少;(2) 文件附有元数据信息。因此,本文研究如何将标签等级、元数据和文本信号纳入监管薄弱的文件分类。我们开发了HimeCat,这是一个基于嵌入的基因化框架,用于我们的任务。具体地说,我们提出一个新的联合代表性学习模块,允许同时建模依赖类别、元数据和文本语义,我们引入一个数据增强模块,按等级整合培训文件,以补充原始的小规模培训数据集。我们的实验表明,HimeCat在竞争基线上不断改进,并验证我们代表性学习和数据增强模块的贡献。