We propose a new approach, Knowledge Distillation using Optimal Transport (KNOT), to distill the natural language semantic knowledge from multiple teacher networks to a student network. KNOT aims to train a (global) student model by learning to minimize the optimal transport cost of its assigned probability distribution over the labels to the weighted sum of probabilities predicted by the (local) teacher models, under the constraints, that the student model does not have access to teacher models' parameters or training data. To evaluate the quality of knowledge transfer, we introduce a new metric, Semantic Distance (SD), that measures semantic closeness between the predicted and ground truth label distributions. The proposed method shows improvements in the global model's SD performance over the baseline across three NLP tasks while performing on par with Entropy-based distillation on standard accuracy and F1 metrics. The implementation pertaining to this work is publicly available at: https://github.com/declare-lab/KNOT.
翻译:我们提议采用新的方法,即利用最佳交通方法(KNOT)进行知识蒸馏,从多个教师网络向学生网络提炼自然语言语义学知识。KNOT的目的是通过学习,最大限度地降低在标签上分配的概率的最佳运输成本,将其分配给(当地)教师模型预测的概率加权和(当地)教师模型预测的概率之和,限制下,学生模型无法获得教师模型的参数或培训数据。为了评估知识转让的质量,我们引入了一种新的指标,即语义距离(SD),以测量预测和地面真实标签分布之间的语义接近性。拟议方法显示全球模型在三个NLP任务的基线上SD性表现的改进,同时在标准精度和F1衡量标准与Entropy的蒸馏率相同的情况下进行。与这项工作有关的实施情况公布在https://github.com/dedlare-lab/KNOT。