Existing text classification methods mainly focus on a fixed label set, whereas many real-world applications require extending to new fine-grained classes as the number of samples per label increases. To accommodate such requirements, we introduce a new problem called coarse-to-fine grained classification, which aims to perform fine-grained classification on coarsely annotated data. Instead of asking for new fine-grained human annotations, we opt to leverage label surface names as the only human guidance and weave in rich pre-trained generative language models into the iterative weak supervision strategy. Specifically, we first propose a label-conditioned finetuning formulation to attune these generators for our task. Furthermore, we devise a regularization objective based on the coarse-fine label constraints derived from our problem setting, giving us even further improvements over the prior formulation. Our framework uses the fine-tuned generative models to sample pseudo-training data for training the classifier, and bootstraps on real unlabeled data for model refinement. Extensive experiments and case studies on two real-world datasets demonstrate superior performance over SOTA zero-shot classification baselines.
翻译:现有文本分类方法主要侧重于固定标签,而许多现实世界应用需要随着每个标签的样本数量的增加而扩大到新的细细类。为了满足这些要求,我们引入了一个新的问题,即粗到松的谷物分类,目的是对粗加注的数据进行细细分类。我们不要求新的细细的人类说明,而是将标签表面名称作为唯一的人类指导,将丰富的事先经过训练的基因化语言模型编织成丰富的人类指南,纳入迭接的薄弱监督战略。具体地说,我们首先提出一个带有标签的微调配方,以适应我们的任务中的这些生成器。此外,我们根据我们的问题设置所产生的粗微的标签限制,设计了一个正规化目标,使我们在先前的配方中甚至有了进一步的改进。我们的框架使用经过精细调整的基因化模型来抽样用于培训分类师的假培训数据,以及用于改进模型的未经标记的真正数据的靴套。关于两个真实世界数据集的实验和案例研究显示SOTA零点分类基线的优性。