The goal of eXtreme Multi-label Learning (XML) is to automatically annotate a given data point with the most relevant subset of labels from an extremely large vocabulary of labels (e.g., a million labels). Lately, many attempts have been made to address this problem that achieve reasonable performance on benchmark datasets. In this paper, rather than coming-up with an altogether new method, our objective is to present and validate a simple baseline for this task. Precisely, we investigate an on-the-fly global and structure preserving feature embedding technique using random projections whose learning phase is independent of training samples and label vocabulary. Further, we show how an ensemble of multiple such learners can be used to achieve further boost in prediction accuracy with only linear increase in training and prediction time. Experiments on three public XML benchmarks show that the proposed approach obtains competitive accuracy compared with many existing methods. Additionally, it also provides around 6572x speed-up ratio in terms of training time and around 14.7x reduction in model-size compared to the closest competitors on the largest publicly available dataset.
翻译:eXtreme多标签学习(XML)的目标是从极为庞大的标签词汇(例如100万个标签)中自动用最相关的标签子组来说明一个特定的数据点,并用最相关的标签子组(例如100万个标签)来说明。最近,为了解决这个问题,在基准数据集上取得了合理的业绩,作出了许多尝试。在本文件中,我们的目标是提出和验证这项任务的简单基准,而不是采用全新的方法。确切地说,我们调查的是使用随机预测,其学习阶段独立于培训样本和标签词汇的在飞行时空的全球性和结构中保存特征嵌入技术。此外,我们展示了如何利用多种这类学习者的集合来进一步提高预测的准确性,只有线性地增加培训和预测时间。关于三种公共XML基准的实验表明,与许多现有方法相比,拟议的方法具有竞争性的准确性。此外,它还提供了大约6572x培训时间的加速率,与最接近的公开数据集的竞争者相比,模型规模减少约14.7x。