In text classification tasks, useful information is encoded in the label names. Label semantic aware systems have leveraged this information for improved text classification performance during fine-tuning and prediction. However, use of label-semantics during pre-training has not been extensively explored. We therefore propose Label Semantic Aware Pre-training (LSAP) to improve the generalization and data efficiency of text classification systems. LSAP incorporates label semantics into pre-trained generative models (T5 in our case) by performing secondary pre-training on labeled sentences from a variety of domains. As domain-general pre-training requires large amounts of data, we develop a filtering and labeling pipeline to automatically create sentence-label pairs from unlabeled text. We perform experiments on intent (ATIS, Snips, TOPv2) and topic classification (AG News, Yahoo! Answers). LSAP obtains significant accuracy improvements over state-of-the-art models for few-shot text classification while maintaining performance comparable to state of the art in high-resource settings.
翻译:在文本分类任务中,将有用的信息编码在标签名称中。Label 语义意识系统利用这一信息改进了在微调和预测期间的文本分类性能。然而,在培训前没有广泛探讨使用标签语义,因此我们提议Label 语义意识培训前(LSAP)提高文本分类系统的一般化和数据效率。LSAP将标签语义纳入经过预先训练的基因化模型(我们的情况是T5),对不同领域的标签语义进行二级预培训。由于一般域培训前需要大量数据,我们开发了一个过滤和标签管道,以便自动从未贴标签的文本中创建句义标签配对。我们进行了意向实验(ATIS、Snips、TOPv2)和专题分类(AG News, Yahoo! 答案)。LSAP在少数文本分类方面比最新工艺模型的精准度有了显著的改进,同时保持与高资源环境中的艺术水平相当。