Automatic topic classification has been studied extensively to assist managing and indexing scientific documents in a digital collection. With the large number of topics being available in recent years, it has become necessary to arrange them in a hierarchy. Therefore, the automatic classification systems need to be able to classify the documents hierarchically. In addition, each paper is often assigned to more than one relevant topic. For example, a paper can be assigned to several topics in a hierarchy tree. In this paper, we introduce a new dataset for hierarchical multi-label text classification (HMLTC) of scientific papers called SciHTC, which contains 186,160 papers and 1,233 categories from the ACM CCS tree. We establish strong baselines for HMLTC and propose a multi-task learning approach for topic classification with keyword labeling as an auxiliary task. Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities on hierarchical scientific topic classification. We make our dataset and code available on Github.
翻译:为了协助管理和编制数字收藏的科学文件,对自动专题分类进行了广泛研究,以协助管理和编制数字收藏的科学文件的索引,近年来有大量的专题,因此有必要按等级排列。因此,自动分类系统需要能够对文件进行等级分类。此外,每份文件往往被分配到一个以上相关专题。例如,可以将一份文件分配给一个等级树的几个专题。在本文中,我们为称为SciHTC的科学文件的等级性多标签文本分类(HMLTC)引入了新的数据集,该数据集包含186,160份文件和CCS树的1,233个类别。我们为HMLTC建立了强有力的基线,并提出了专题分类的多任务学习方法,将关键词标签作为辅助任务。我们的最佳模型达到34.57%的Mroc-F1分,显示该数据集为等级科学专题分类提供了重要的研究机会。我们在Github上提供了数据集和代码。