We study the coarse-grained selection module in retrieval-based chatbot. Coarse-grained selection is a basic module in a retrieval-based chatbot, which constructs a rough candidate set from the whole database to speed up the interaction with customers. So far, there are two kinds of approaches for coarse-grained selection module: (1) sparse representation; (2) dense representation. To the best of our knowledge, there is no systematic comparison between these two approaches in retrieval-based chatbots, and which kind of method is better in real scenarios is still an open question. In this paper, we first systematically compare these two methods from four aspects: (1) effectiveness; (2) index stoarge; (3) search time cost; (4) human evaluation. Extensive experiment results demonstrate that dense representation method significantly outperforms the sparse representation, but costs more time and storage occupation. In order to overcome these fatal weaknesses of dense representation method, we propose an ultra-fast, low-storage, and highly effective Deep Semantic Hashing Coarse-grained selection method, called DSHC model. Specifically, in our proposed DSHC model, a hashing optimizing module that consists of two autoencoder models is stacked on a trained dense representation model, and three loss functions are designed to optimize it. The hash codes provided by hashing optimizing module effectively preserve the rich semantic and similarity information in dense vectors. Extensive experiment results prove that, our proposed DSHC model can achieve much faster speed and lower storage than sparse representation, with limited performance loss compared with dense representation. Besides, our source codes have been publicly released for future research.
翻译:我们研究了以检索为基础的聊天室中的粗皮选择模块。粗皮选择是一个基于检索的聊天室中的基本模块。粗皮选择是一个基本的模块,它从整个数据库中构建了一个粗糙的候选人组,以加快与客户的互动。到目前为止,粗皮选择模块有两种方法:(1) 代表性稀少;(2) 代表性密集。据我们所知,在基于检索的聊天室中,这两种方法之间没有系统性的比较,在真实情景中哪种方法更好,仍然是一个尚未解决的问题。在本文中,我们首先系统地将这两种方法与四个方面进行比较:(1) 有效性;(2) 指数蒸气;(3) 搜索时间成本;(4) 人类评价。广泛的实验结果表明,粗粗皮代表制方法大大超过代表的稀少,但花费更多的时间和储存占用。为了克服基于广泛代表性方法的这些致命弱点,我们提议了一种超快、低存储率和高效率的深精度含精度的含精度的碳酸度代表度选择方法,称为DSHC模型。具体地说,我们提议的深度代表制的深度代表模型已经有效地实现了一种经过培训的深度成本模型,而现在又进行最精确的存储模式。