IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with lowfrequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present stateof-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.
翻译:使用预先培训的语言模型的IR模型明显优于BM25等精密词汇学方法。 特别是, 将文本编码为稀散矢量的苏人解DE, 是一个有效的实用模式, 因为它显示了外域数据集的坚固性。 然而, 苏人解仍然在努力精确匹配培训数据中的低频字词。 此外, 词汇和字频的域变换会恶化 NASADE 的 IR 性能。 因为目标域的监督数据稀缺, 因此有必要在没有监管数据的情况下处理域变换问题。 本文通过填充词汇和字频差, 提出了一种不受监督的域调整方法。 首先, 我们扩展了词汇表, 并在目标域的文体上用隐蔽语言模型持续进行预培训 。 然后, 我们用反向文档的频率重量将苏人解的稀散矢量编码成数, 以考虑文件的重要性 低频字数。 我们用我们的方法在数据集上进行了实验, 并且从源域进行一个大的词汇学差距。 我们显示我们的方法比目前状态的域适应方法要优于现在的域适应方法 。 此外, 我们的方法用了25, 我们的方法实现了状态组合的BMMF 。