Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by this, we present Semantic Hashing with Pairwise Reconstruction (PairRec), which is a discrete variational autoencoder based hashing model. PairRec first encodes weakly supervised training pairs (a query document and a semantically similar document) into two hash codes, and then learns to reconstruct the same query document from both of these hash codes (i.e., pairwise reconstruction). This pairwise reconstruction enables our model to encode local neighbourhood structures within the hash code directly through the decoder. We experimentally compare PairRec to traditional and state-of-the-art approaches, and obtain significant performance improvements in the task of document similarity search.
翻译:语义散列是一个大型数据集中高效相似搜索方法的流行组合。 在语义散列中, 文档被编码为短二进制矢量( 即散列码), 这样使用 Hamming 距离可以有效计算语义相似性。 最近最先进的方法利用了薄弱的监管来训练更好的散列模型。 受此启发, 我们用 PairWise Reformation (PairRec) 介绍语义散列和 pairWise Reformation (PairRec) (PairRec), 这是一种以散装为基础的散装自动编码模型。 PairRec 首先将监管不力的两对培训对象( 查询文档和 语义相似的文件) 编码为两个散列码, 然后学习从这两种散列代码( e. 双向重建) 中重建相同的查询文档。 这种对称的重建使得我们的模式能够直接通过解码将本地邻居结构编码直接编码。 我们实验性地将PairRec 与传统和状态搜索方法进行比较, 并在类似文档的任务中取得显著的性改进。