In the domain of question-answering in NLP, the retrieval of Frequently Asked Questions (FAQ) is an important sub-area which is well researched and has been worked upon for many languages. Here, in response to a user query, a retrieval system typically returns the relevant FAQs from a knowledge-base. The efficacy of such a system depends on its ability to establish semantic match between the query and the FAQs in real-time. The task becomes challenging due to the inherent lexical gap between queries and FAQs, lack of sufficient context in FAQ titles, scarcity of labeled data and high retrieval latency. In this work, we propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields (like, question, answer, and category) both during model training and inference. Our proposed Multi-Field Bi-Encoder (MFBE) model benefits from the additional context resulting from multiple FAQ fields and performs well even with minimal labeled data. We empirically support this claim through experiments on proprietary as well as open-source public datasets in both unsupervised and supervised settings. Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets, respectively over the best performing baseline.
翻译:在NLP的问答领域,检索常见问题(FAQ)是一个重要的子领域,研究周密,并且已经对多种语言进行了研究。这里,一个检索系统通常会从知识库中返回相关的 FAQ。这种系统的效力取决于它是否有能力实时在查询与 FAQ 之间建立语义匹配。由于查询与 FAQ 之间固有的逻辑差异,缺乏在 FAQ 标题中足够的背景,标签数据稀缺和高检索时间长度,这项任务变得具有挑战性。在这项工作中,我们提出了一个基于双编码的查询-FAQ 匹配模式,在模式培训和推断期间利用FAQ 字段的多种组合(例如、问题、答案和类别)。我们提议的多字段Bi-Encoder(MFMBE)模型从多个 FAQ 字段和 FAQ 字段的额外背景中得益,而且即使标签数据也很少。我们通过在最佳产权模型上进行实验来支持这项索赔,同时分别作为开放源代码和在开放源代码上分别实现开放的将数据定位的20- 和在内部数据设置上进行更好的内部检索。