Insufficient or even unavailable training data of emerging classes is a big challenge of many classification tasks, including text classification. Recognising text documents of classes that have never been seen in the learning stage, so-called zero-shot text classification, is therefore difficult and only limited previous works tackled this problem. In this paper, we propose a two-phase framework together with data augmentation and feature augmentation to solve this problem. Four kinds of semantic knowledge (word embeddings, class descriptions, class hierarchy, and a general knowledge graph) are incorporated into the proposed framework to deal with instances of unseen classes effectively. Experimental results show that each and the combination of the two phases achieve the best overall accuracy compared with baselines and recent approaches in classifying real-world texts under the zero-shot scenario.
翻译:因此,承认在学习阶段从未见过的类别文字文件,即所谓的零点文字分类,是困难的,而且以前的工作也很有限。在本文件中,我们提出了一个两个阶段的框架,连同数据增强和特性增强,以解决这一问题。四种语义知识(词嵌入、类描述、等级等级和一般知识图)被纳入拟议的框架,以有效处理不可见类别的例子。实验结果显示,这两个阶段的每一个阶段和两个阶段的组合都实现了与基线和最近对零点情景下实际世界文本进行分类的方法相比的最佳总体准确性。