We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top-$k$ ranks, we generate them using pre-trained LMs: GPT-2, BERT, and Born-Again models. This leads to a rank-based form of knowledge distillation (KD). We also develop a method using $N$-grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM. We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally improves perplexity (PPL), often with statistical significance, when compared to Kullback-Leibler-based KD. Surprisingly, given the simplicity of the method, $N$-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model teachers. GPT-2 always acts as the best teacher, though, and using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70.
翻译:我们认为语言建模(LM)是一种多标签结构化的预测任务,从单纯预测单一地面真相单词到排出一套可以延续特定背景的词组的重新配置培训,将语言建模(LM)视为一种多标签结构化的预测任务。为避免注解顶级美元,我们用预先培训的LM模式(GPT-2、BERT和Born-Again)生成语言建模(LM),这导致一种基于等级的知识蒸馏(KD)形式。我们还开发了一种方法,用$-ggs来创建一个非概率化的教师队伍,这种教师队伍的产生不需要经过预先培训的LMM。我们确认了这样的假设,即我们可以把LMing当作一个排名任务,而且我们可以这样做,而不用经过预先培训的LM。 我们表明,与Kullback-Leiperr基础KD相比,通常具有统计重要性的PL。 令人惊讶的是,由于方法的简单,$Ngs作为有竞争力的教师,并取得类似的业绩,例如使用BERT或BERT-L,使用K和BRAV-L的教师。