Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multi-class settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.
翻译:在大数据时代,极端多标签分类(XML)正在变得日益重要。然而,没有有效生成XML数据集分层分割的方法。相反,研究人员通常依赖提供的测试-培训分解,其中1(1)并非总能代表整个数据集,2)缺少许多标签。这可能导致二进制和多级设置中确立的概括化能力差和性能估计不可靠。因此,本文件提出了一个新的、简单的算法,可以有效地生成具有数百万个独特标签的XML数据集分层分割。我们还检查了当前基准分解的标签分布情况,并调查了在模型开发中使用不具有代表性的一组数据所产生的问题。结果突出表明了压缩XML数据的困难,并展示了在培训和评估中使用分层分割分割分区的重要性。