Multi-label text classification refers to the problem of assigning each given document its most relevant labels from the label set. Commonly, the metadata of the given documents and the hierarchy of the labels are available in real-world applications. However, most existing studies focus on only modeling the text information, with a few attempts to utilize either metadata or hierarchy signals, but not both of them. In this paper, we bridge the gap by formalizing the problem of metadata-aware text classification in a large label hierarchy (e.g., with tens of thousands of labels). To address this problem, we present the MATCH solution -- an end-to-end framework that leverages both metadata and hierarchy information. To incorporate metadata, we pre-train the embeddings of text and metadata in the same space and also leverage the fully-connected attentions to capture the interrelations between them. To leverage the label hierarchy, we propose different ways to regularize the parameters and output probability of each child label by its parents. Extensive experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH over state-of-the-art deep learning baselines.
翻译:多标签文本分类是指分配每个特定文件的标签的问题。 通常, 给定文件的元数据和标签的等级分级在现实世界应用程序中都有。 但是, 大多数现有研究只侧重于文本信息的建模, 几次尝试使用元数据或等级信号, 但没有同时使用这两种信号。 在本文中, 我们通过在大型标签等级( 例如, 贴上数万个标签) 中正式处理元数据识别文本分类问题来弥合差距。 为了解决这个问题, 我们介绍了 MATCH 解决方案 -- -- 一个利用元数据和等级信息的端到端框架。 要整合元数据, 我们预先将文本和元数据嵌入同一空间, 并且利用完全相连的注意力来捕捉它们之间的相互关系。 为了利用标签等级, 我们建议了不同的方法来规范每个儿童标签的参数和输出概率。 在两个大型文本数据集上进行广泛的实验, 大型标签分级显示 MATCH 相对于状态深层学习基线的有效性 。