Solving text classification in a weakly supervised manner is important for real-world applications where human annotations are scarce. In this paper, we propose to query a masked language model with cloze style prompts to obtain supervision signals. We design a prompt which combines the document itself and "this article is talking about [MASK]." A masked language model can generate words for the [MASK] token. The generated words which summarize the content of a document can be utilized as supervision signals. We propose a latent variable model to learn a word distribution learner which associates generated words to pre-defined categories and a document classifier simultaneously without using any annotated data. Evaluation on three datasets, AGNews, 20Newsgroups, and UCINews, shows that our method can outperform baselines by 2%, 4%, and 3%.
翻译:以微弱的监管方式解决文本分类对于在人类注释稀少的地方真实世界应用非常重要。 在本文中, 我们提议查询一个隐蔽语言模式, 使用凝块样式提示来获取监管信号。 我们设计一个快速的选项, 将文件本身和“ 文章正在谈论 [MASK] 符号。 一个隐蔽语言模式可以生成单词。 生成的单词可以总结文档的内容, 用作监管信号。 我们提出了一个潜在的变量模型, 学习一个单词分布学习器, 将生成的单词与预定义的类别联系起来, 并同时使用一个文件分类器, 而不使用附加说明的数据 。 对三个数据集( AGNews, 20 News groups, 和 UCINews) 的评估显示, 我们的方法可以比基线高出2%、 4% 和 3% 。