Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.
翻译:文件分类的目的是为每份文件指定一个专题标签,在各种应用中发挥着根本作用。尽管在常规监督文件分类方面现有研究取得了成功,但它们对两个实际问题不那么关心:(1) 元数据的存在:(1) 元数据的存在:在许多领域,文本附有各种额外信息,例如作者和标签;这类元数据是令人信服的专题指标,应作为分类框架的杠杆;(2) 标签稀缺:标签培训样本价格昂贵,在某些情况下,分类只需使用一小套附加说明的数据即可获得。我们认识到这两个挑战,建议MetACat(MetaCat)是用元数据对文本进行分类的最低限度监督的框架。具体地说,我们开发了描述文字、文件、标签和元数据之间关系的基因化过程。在基因化模型的指导下,我们将文本和元数据嵌入同一语系空间,以编码混杂信号。然后,根据同样的基因化过程,我们综合培训样本,以解决标签缺乏的瓶颈问题。我们对一系列广泛的数据集进行彻底评估。实验结果证明MetaCat对许多竞争性基线的有效性。