In this paper, we explore text classification with extremely weak supervision, i.e., only relying on the surface text of class names. This is a more challenging setting than the seed-driven weak supervision, which allows a few seed words per class. We opt to attack this problem from a representation learning perspective -- ideal document representations should lead to nearly the same results between clustering and the desired classification. In particular, one can classify the same corpus differently (e.g., based on topics and locations), so document representations should be adaptive to the given class names. We propose a novel framework X-Class to realize the adaptive representations. Specifically, we first estimate class representations by incrementally adding the most similar word to each class until inconsistency arises. Following a tailored mixture of class attention mechanisms, we obtain the document representation via a weighted average of contextualized word representations. With the prior of each document assigned to its nearest class, we then cluster and align the documents to classes. Finally, we pick the most confident documents from each cluster to train a text classifier. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets. Our dataset and code are released at https://github.com/ZihanWangKi/XClass/ .
翻译:在本文中,我们探索文本分类时监督极为薄弱,即只依赖类名表面文字。这比种子驱动的薄弱监管更具有挑战性,因为种子驱动的薄弱监管允许每类增加几个种子字。我们选择从代表性学习的角度来解决这个问题 -- 理想的文件表述应当导致集群和理想分类之间几乎相同的结果。特别是,可以对同一内容进行不同的分类(例如,根据主题和地点),因此文件表述应当适应给定类名称。我们提议了一个新的X-Clas框架,以实现适应性表述。具体地说,我们首先通过在出现不一致之前在每类中增加一个最相似的单词来估计等级代表。我们根据一个定制的类别关注机制,通过一个按背景分列的表达的加权平均数来获取文件表述。在将每份文件分配给最近的类别之前,我们然后将文件分组集中起来,并将文件与类别相协调。最后,我们从每个组中挑选最自信的文件来培训一个文本分类员。广泛的实验表明,X-Classs能够对每个类别进行对比,甚至超越以种子驱动的弱力驱动的分类,直到出现不一致。我们用7/KI/Z数据库的数据设置的数据代码。我们的数据设置了。我们在7/KARC/Z的基准数据基准数据设置上,我们的数据代码。