Semantic hashing has become a crucial component of fast similarity search in many large-scale information retrieval systems, in particular, for text data. Variational auto-encoders (VAEs) with binary latent variables as hashing codes provide state-of-the-art performance in terms of precision for document retrieval. We propose a pairwise loss function with discrete latent VAE to reward within-class similarity and between-class dissimilarity for supervised hashing. Instead of solving the optimization relying on existing biased gradient estimators, an unbiased low-variance gradient estimator is adopted to optimize the hashing function by evaluating the non-differentiable loss function over two correlated sets of binary hashing codes to control the variance of gradient estimates. This new semantic hashing framework achieves superior performance compared to the state-of-the-arts, as demonstrated by our comprehensive experiments.
翻译:在许多大型信息检索系统中,特别是在文本数据方面,语义散列已成为快速相似搜索的一个关键组成部分。随着散列代码在文件检索的精确性方面提供最先进的性能,具有双重潜在变数的自动自动编码器(VAE)已成为许多大规模信息检索系统中快速相似搜索的关键组成部分。我们建议使用离散潜伏的自定义散列函数来奖励受监督的散列在类内相似性和阶级间差异性。我们不解决利用现有偏差梯度估计器的优化问题,而是采用一个公正的低变差梯度估计器来优化散列功能,通过对两套相关的二元代码的不可区分损失函数进行评估来控制梯度估计的差异。正如我们的全面实验所证明的那样,这个新的语义散列框架取得了优于状态的性能。