Current state-of-the-art cross-lingual summarization models employ multi-task learning paradigm, which works on a shared vocabulary module and relies on the self-attention mechanism to attend among tokens in two languages. However, correlation learned by self-attention is often loose and implicit, inefficient in capturing crucial cross-lingual representations between languages. The matter worsens when performing on languages with separate morphological or structural features, making the cross-lingual alignment more challenging, resulting in the performance drop. To overcome this problem, we propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization, seeking to explicitly construct cross-lingual correlation by distilling the knowledge of the monolingual summarization teacher into the cross-lingual summarization student. Since the representations of the teacher and the student lie on two different vector spaces, we further propose a Knowledge Distillation loss using Sinkhorn Divergence, an Optimal-Transport distance, to estimate the discrepancy between those teacher and student representations. Due to the intuitively geometric nature of Sinkhorn Divergence, the student model can productively learn to align its produced cross-lingual hidden states with monolingual hidden states, hence leading to a strong correlation between distant languages. Experiments on cross-lingual summarization datasets in pairs of distant languages demonstrate that our method outperforms state-of-the-art models under both high and low-resourced settings.
翻译:目前最先进的跨语言综合模型采用多任务学习模式,该模式以共享词汇模块为基础,依靠自留机制,在两种语言的象征间参与。然而,自我注意所学的关联性往往松散和隐含,在掌握各种语言之间关键的跨语言代表时效率低下。在使用具有不同形态或结构特点的语言时,问题恶化,使跨语言协调更具挑战性,导致业绩下降。为了克服这一问题,我们提议了一个基于知识的新的跨语言拼接框架,寻求通过将单一语言拼接教师的知识注入跨语言的拼凑学生中,明确建立跨语言的关联性。由于教师和学生的表达方式位于两个不同的矢量空间,我们进一步提议使用Sinkhorn Divergence(最佳的运输距离)进行知识蒸发损失,以估计这些教师和学生的表达方式之间的差异。由于Sinkhorn Divergence的直径对称性,学生模型通过将单一语言拼接单一语言的单口语拼接关系,在高语言和高语言之间,在高语言的深处,学生之间可以通过高语言的深层次实验方式进行交叉学习。