Self-supervised representation learning has proved to be a valuable component for out-of-distribution (OoD) detection with only the texts of in-distribution (ID) examples. These approaches either train a language model from scratch or fine-tune a pre-trained language model using ID examples, and then take perplexity as output by the language model as OoD scores. In this paper, we analyse the complementary characteristics of both OoD detection methods and propose a multi-level knowledge distillation approach to integrate their strengths, while mitigating their limitations. Specifically, we use a fine-tuned model as the teacher to teach a randomly initialized student model on the ID examples. Besides the prediction layer distillation, we present a similarity-based intermediate layer distillation method to facilitate the student's awareness of the information flow inside the teacher's layers. In this way, the derived student model gains the teacher's rich knowledge about the ID data manifold due to pre-training, while benefiting from seeing only ID examples during parameter learning, which promotes more distinguishable features for OoD detection. We conduct extensive experiments over multiple benchmark datasets, i.e., CLINC150, SST, 20 NewsGroups, and AG News; showing that the proposed method yields new state-of-the-art performance.
翻译:自我监督的代表学习被证明是一个宝贵的组成部分,可用于在分配(OoD)之外检测(OoD)的优点,只有分布(ID)实例的文本。这些方法要么从零开始训练一种语言模型,要么使用ID实例微调一个经过事先训练的语言模型,然后以OoD分数作为语言模型的输出,自上而下。在本文中,我们分析OOOD检测方法的互补性,提出一种多级知识蒸馏方法,以整合其优势,同时减少其局限性。具体地说,我们用一个微调模型作为教师,在ID实例中教授一个随机的初始学生模型。除了预测层蒸馏外,我们还采用一种基于类似特性的中间层蒸馏方法,以方便学生对教师层次内信息流动的认识。这样,衍生的学生模型将获得教师因预先培训而对ID数据多方面的丰富知识,同时在参数学习期间只看到ID实例,这有利于为OOD检测提供更多的可辨别特征。我们在多个基准数据集上进行广泛的实验, i. IS- group the 20 Astal 和Agroup the New ress.