Semantic text matching is a critical problem in information retrieval. Recently, deep learning techniques have been widely used in this area and obtained significant performance improvements. However, most models are black boxes and it is hard to understand what happened in the matching process, due to the poor interpretability of deep learning. This paper aims at tackling this problem. The key idea is to test whether existing deep text matching methods satisfy some fundamental heuristics in information retrieval. Specifically, four heuristics are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint. Since deep matching models usually contain many parameters, it is difficult to conduct a theoretical study for these complicated functions. In this paper, We propose an empirical testing method. Specifically, We first construct some queries and documents to make them satisfy the assumption in a constraint, and then test to which extend a deep text matching model trained on the original dataset satisfies the corresponding constraint. Besides, a famous attribution based interpretation method, namely integrated gradient, is adopted to conduct detailed analysis and guide for feasible improvement. Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods, both representation and interaction based methods, satisfy the above constraints with high probabilities in statistics. We further extend these constraints to the semantic settings, which are shown to be better satisfied for all the deep text matching models. These empirical findings give clear understandings on why deep text matching models usually perform well in information retrieval. We believe the proposed evaluation methodology will be useful for testing future deep text matching models.
翻译:语义文本匹配是信息检索中的一个关键问题。 最近, 深层次的学习技术在信息检索中被广泛使用, 并取得了显著的绩效改进。 但是, 大多数模型都是黑盒, 并且很难理解匹配过程中发生的情况, 因为深层学习的解释性差。 本文旨在解决这一问题。 关键的想法是测试现有的深层文本匹配方法是否满足信息检索中的一些基本偏差。 具体地说, 我们的研究中使用了四种粗俗的文本匹配方法, 即术语频率限制、 术语歧视限制、 长度正常化限制和 TF- 长度限制。 由于深层匹配模型通常包含许多参数, 很难为这些复杂的功能进行理论研究。 在本文中, 我们提出一个实验性测试方法, 以在限制中满足假设的假设, 然后再测试以原始数据集培训的深度匹配模型满足相应的制约。 此外, 以名化的分类法为基础, 通常为深层次的精度解释方法, 来进行详细的匹配模型和导出可行的改进。 在 LELEOR 4.0 和 MS Marco 上 的实验性 测试结果显示所有深度的精确的校正的校正的校正的校正的校正的校正的校正的校正的校正都显示, 的校正的校正的校正的校正的校正的校正的校正的校正方法, 我们的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正方法, 将进一步的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正方法, 将进一步的校正的校正的校正方法, 我们的校正的校正的校正的校正的校正的校正方法, 我们的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正的校正方法, 将