Hierarchical text classification, which aims to classify text documents into a given hierarchy, is an important task in many real-world applications. Recently, deep neural models are gaining increasing popularity for text classification due to their expressive power and minimum requirement for feature engineering. However, applying deep neural networks for hierarchical text classification remains challenging, because they heavily rely on a large amount of training data and meanwhile cannot easily determine appropriate levels of documents in the hierarchical setting. In this paper, we propose a weakly-supervised neural method for hierarchical text classification. Our method does not require a large amount of training data but requires only easy-to-provide weak supervision signals such as a few class-related documents or keywords. Our method effectively leverages such weak supervision signals to generate pseudo documents for model pre-training, and then performs self-training on real unlabeled data to iteratively refine the model. During the training process, our model features a hierarchical neural structure, which mimics the given hierarchy and is capable of determining the proper levels for documents with a blocking mechanism. Experiments on three datasets from different domains demonstrate the efficacy of our method compared with a comprehensive set of baselines.
翻译:旨在将文本文件分类为特定等级的等级制文本分类在许多现实应用中是一项重要任务。最近,深神经模型由于其表现力和特征工程的最起码要求,对文本分类越来越受欢迎。然而,应用深神经网络进行等级级文本分类仍然具有挑战性,因为它们严重依赖大量培训数据,同时不易确定等级设置中的适当文件级别。在本文件中,我们建议为等级文本分类采用一种薄弱、监管不力的神经系统方法。我们的方法不需要大量的培训数据,而只需要一些与阶级有关的文件或关键词等容易获得的薄弱监督信号。我们的方法有效地利用这些薄弱的监督信号来生成模拟培训前的假文件,然后对真实的无标签数据进行自我培训,以便迭接地完善模型。在培训过程中,我们的模型具有一种等级神经结构,它模拟了给给定的等级,能够用阻塞机制确定文件的适当级别。从三个不同领域进行的实验表明我们的方法与一套综合基线相比是有效的。