Duplicate question detection (DQD) is important to increase efficiency of community and automatic question answering systems. Unfortunately, gathering supervised data in a domain is time-consuming and expensive, and our ability to leverage annotations across domains is minimal. In this work, we leverage neural representations and study nearest neighbors for cross-domain generalization in DQD. We first encode question pairs of the source and target domain in a rich representation space and then using a k-nearest neighbour retrieval-based method, we aggregate the neighbors' labels and distances to rank pairs. We observe robust performance of this method in different cross-domain scenarios of StackExchange, Spring and Quora datasets, outperforming cross-entropy classification in multiple cases.
翻译:重复问题检测(DQD)对于提高社区和自动问答系统的效率非常重要。 不幸的是,在一个领域收集受监督的数据耗时昂贵,而且我们在各个领域调用说明的能力微乎其微。 在这项工作中,我们利用神经代表器并研究最近的邻居,以便在DQD中进行跨域概括化。 我们首先将源和目标域的对问题对子编码成一个丰富的代表空间,然后使用K-近邻检索法,我们将邻居的标签和距离与排名对子的距离进行汇总。 我们观察到这种方法在StakExchange、Spring和Quora等不同跨域的假想中,在多种情况下的跨物种分类性强。