Extreme multi-label text classification (XMTC) refers to the problem of tagging a given text with the most relevant subset of labels from a large label set. A majority of labels only have a few training instances due to large label dimensionality in XMTC. To solve this data sparsity issue, most existing XMTC methods take advantage of fixed label clusters obtained in early stage to balance performance on tail labels and head labels. However, such label clusters provide static and coarse-grained semantic scope for every text, which ignores distinct characteristics of different texts and has difficulties modelling accurate semantics scope for texts with tail labels. In this paper, we propose a novel framework TReaderXML for XMTC, which adopts dynamic and fine-grained semantic scope from teacher knowledge for individual text to optimize text conditional prior category semantic ranges. TReaderXML dynamically obtains teacher knowledge for each text by similar texts and hierarchical label information in training sets to release the ability of distinctly fine-grained label-oriented semantic scope. Then, TReaderXML benefits from a novel dual cooperative network that firstly learns features of a text and its corresponding label-oriented semantic scope by parallel Encoding Module and Reading Module, secondly embeds two parts by Interaction Module to regularize the text's representation by dynamic and fine-grained label-oriented semantic scope, and finally find target labels by Prediction Module. Experimental results on three XMTC benchmark datasets show that our method achieves new state-of-the-art results and especially performs well for severely imbalanced and sparse datasets.
翻译:极端多标签文本分类 (XMTC) 指的是用一个大标签集中最相关的标签子集标记给给给定文本贴上标签的问题。 大多数标签仅因 XMTC 中的大标签维度而有一些培训实例。 要解决数据偏狭问题, 多数现有的 XMTC 方法利用在早期获得的固定标签分类组来平衡尾标签和头标签的性能。 然而, 这些标签组为每个文本提供静态和粗粗化的语义范围, 它忽略了不同文本的不同特性, 并且难以为尾标签的文本模拟准确的语义范围。 在本文件中, 我们为 XMTC 提出了一个全新的框架 TRederXML 。 它采用动态和精细细微的语义范围 。 之后, TRederXMC 将教师的语义范围从单个文本优化在早期获得的有条件的语义分类 。 TReadderXMLML 动态获得教师对每文本的了解, 在培训组中通过相似的文字和等级标签基级信息来释放明显精细的标签定的语义缩缩缩缩定义范围 。 然后,, 以导的SilderXMLIMLLLLLLLL 将获得一个双向导的语系 和双向导的语系的模级的文系 。