Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name.
翻译:目前的文本分类方法通常需要大量的人类标签文件作为培训数据,这在实际应用中可能成本高昂,难以获得。人类可以在不看到任何标签实例的情况下进行分类,但只能根据描述分类类别的小套词进行分类。在本文件中,我们探索了仅使用每个类的标签名称来培训关于未标签数据的分类模型的潜力,而没有使用任何标签文件。我们使用预先训练的神经语言模型作为一般语言知识来源,用于类别理解,并作为文件分类的代言学习模式。我们的方法(1) 将语义相关的词汇与标签名称联系起来,(2) 找到类名词,培训模型以预测其隐含的类别,(3) 通过自我培训将模型概括化。我们显示,我们的模型在四个基准数据集上实现了大约90%的准确度,包括专题和情绪分类,而没有使用任何标签文件,而是学习以最多3个单词(在大多数情况下为1个)作为标签名称监督的未标签数据。