The extreme multi-label classification (XMC) task aims at tagging content with a subset of labels from an extremely large label set. The label vocabulary is typically defined in advance by domain experts and assumed to capture all necessary tags. However in real world scenarios this label set, although large, is often incomplete and experts frequently need to refine it. To develop systems that simplify this process, we introduce the task of open vocabulary XMC (OXMC): given a piece of content, predict a set of labels, some of which may be outside of the known tag set. Hence, in addition to not having training data for some labels - as is the case in zero-shot classification - models need to invent some labels on-the-fly. We propose GROOV, a fine-tuned seq2seq model for OXMC that generates the set of labels as a flat sequence and is trained using a novel loss independent of predicted label order. We show the efficacy of the approach, experimenting with popular XMC datasets for which GROOV is able to predict meaningful labels outside the given vocabulary while performing on par with state-of-the-art solutions for known labels.
翻译:极端多标签分类( XMC) 任务 极端的多标签分类( XMC) 任务的目的是用一个非常大的标签组的一组标签对内容进行标记。 标签词汇通常由域专家事先确定, 并假定可以捕捉所有必要的标签。 然而, 在现实世界中, 这个标签组虽然庞大, 却往往不完全, 专家经常需要改进它。 为了开发简化这个过程的系统, 我们引入了开放词汇组XMC( OXMC) 的任务: 给一个部分内容, 预测一组标签, 有些可能是已知标签组以外的标签组。 因此, 除了没有某些标签的培训数据外( 如零发分类一样), 标签词汇组模型需要发明一些在天上的标签。 我们提议GROOV, 一个微调的后方2eqeq 模型, 用于生成一组标签, 作为固定序列, 并经过培训, 使用一种独立于预测标签顺序的新损失。 我们展示了该方法的功效, 实验了流行的 XMC 数据集, 其中GROOV 能够预测给特定词汇组以外的有意义的标签 。