In the domain of question-answering in NLP, the retrieval of Frequently Asked Questions (FAQ) is an important sub-area which is well researched and has been worked upon for many languages. Here, in response to a user query, a retrieval system typically returns the relevant FAQs from a knowledge-base. The efficacy of such a system depends on its ability to establish semantic match between the query and the FAQs in real-time. The task becomes challenging due to the inherent lexical gap between queries and FAQs, lack of sufficient context in FAQ titles, scarcity of labeled data and high retrieval latency. In this work, we propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields (like, question, answer, and category) both during model training and inference. Our proposed Multi-Field Bi-Encoder (MFBE) model benefits from the additional context resulting from multiple FAQ fields and performs well even with minimal labeled data. We empirically support this claim through experiments on proprietary as well as open-source public datasets in both unsupervised and supervised settings. Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets, respectively over the best performing baseline.
翻译:在自然语言处理中回答问题的领域中,常见问题解答(FAQ)的检索是一个重要的子领域,该领域已经得到了很好的研究,并且已经在很多语言上进行研究。在这里,针对用户查询,检索系统通常会从知识库中返回相关的常见问题解答。这种系统的功效取决于它在实时中建立查询和常见问题解答之间的语义匹配的能力。该任务由于查询和常见问题解答之间的天然词汇鸿沟、FAQ标题中缺乏足够的上下文,标记数据稀缺和高检索延迟而变得具有挑战性。在这项工作中,我们提出了一种基于双编码器的查询 - 常见问题解答匹配模型,该模型利用多个常见问题解答领域(如问题、答案和类别)的多种组合进行模型训练和推断。我们建议的多场双编码器(MFBE)模型从多个常见问题解答领域的附加上下文中获益,即使只有最少的标记数据,也能表现出色。我们通过对专有和开源公共数据集的实验在有监督和无监督设置下支持这种声明。我们的模型在内部和开放数据集上的FAQ检索任务中的top-1精度分别达到最佳基线的27%和20%左右。