We study the problem of incorporating prior knowledge into a deep Transformer-based model,i.e.,Bidirectional Encoder Representations from Transformers (BERT), to enhance its performance on semantic textual matching tasks. By probing and analyzing what BERT has already known when solving this task, we obtain better understanding of what task-specific knowledge BERT needs the most and where it is most needed. The analysis further motivates us to take a different approach than most existing works. Instead of using prior knowledge to create a new training task for fine-tuning BERT, we directly inject knowledge into BERT's multi-head attention mechanism. This leads us to a simple yet effective approach that enjoys fast training stage as it saves the model from training on additional data or tasks other than the main task. Extensive experiments demonstrate that the proposed knowledge-enhanced BERT is able to consistently improve semantic textual matching performance over the original BERT model, and the performance benefit is most salient when training data is scarce.
翻译:我们研究了将先前的知识纳入深层变异器模型的问题,即变异器的双向编码显示器(BERT),以提高其在语义文字匹配任务方面的表现。通过测试和分析在完成这项任务时已经知道的BERT,我们更好地了解了哪些具体任务知识最需要,哪些最需要。分析进一步激励我们采取与大多数现有工作不同的方法。我们没有利用先前的知识为微调BERT创建新的培训任务,而是直接将知识注入BERT的多头目关注机制。这导致我们找到一种简单而有效的方法,即快速的培训阶段,因为它将模型从关于除主要任务之外的额外数据或任务的培训中拯救出来。广泛的实验表明,拟议的知识强化的BERT能够不断改进与原始的BERT模型的语义文字匹配性能,在培训数据稀缺时,业绩效益最为显著。