用于自然语言推理的树匹配网络：基于依存句法树的参数高效语义理解 (Tree Matching Networks for Natural Language Inference: Parameter-Efficient Semantic Understanding via Dependency Parse Trees)

In creating sentence embeddings for Natural Language Inference (NLI) tasks, using transformer-based models like BERT leads to high accuracy, but require hundreds of millions of parameters. These models take in sentences as a sequence of tokens, and learn to encode the meaning of the sequence into embeddings such that those embeddings can be used reliably for NLI tasks. Essentially, every word is considered against every other word in the sequence, and the transformer model is able to determine the relationships between them, entirely from scratch. However, a model that accepts explicit linguistic structures like dependency parse trees may be able to leverage prior encoded information about these relationships, without having to learn them from scratch, thus improving learning efficiency. To investigate this, we adapt Graph Matching Networks (GMN) to operate on dependency parse trees, creating Tree Matching Networks (TMN). We compare TMN to a BERT based model on the SNLI entailment task and on the SemEval similarity task. TMN is able to achieve significantly better results with a significantly reduced memory footprint and much less training time than the BERT based model on the SNLI task, while both models struggled to preform well on the SemEval. Explicit structural representations significantly outperform sequence-based models at comparable scales, but current aggregation methods limit scalability. We propose multi-headed attention aggregation to address this limitation.

翻译：在构建用于自然语言推理（NLI）任务的句子嵌入时，使用基于Transformer的模型（如BERT）能够实现高精度，但需要数亿参数。这些模型将句子作为词元序列输入，学习将序列语义编码为嵌入表示，使得这些嵌入可可靠地用于NLI任务。本质上，模型会考虑序列中每个词与其他所有词的关系，Transformer模型能够完全从零开始学习这些关系。然而，接受显式语言结构（如依存句法树）的模型可能利用这些关系的先验编码信息，无需从零学习，从而提高学习效率。为验证此假设，我们将图匹配网络（GMN）适配至依存句法树，构建树匹配网络（TMN）。我们在SNLI蕴含任务和SemEval相似度任务上比较TMN与基于BERT的模型。在SNLI任务中，TMN以显著减少的内存占用和更短的训练时间取得明显优于BERT模型的结果，而两种模型在SemEval任务中均表现欠佳。显式结构表示在可比规模下显著优于基于序列的模型，但当前聚合方法限制了可扩展性。我们提出多头注意力聚合机制以应对此局限性。