Recurrent models for sequences have been recently successful at many tasks, especially for language modeling and machine translation. Nevertheless, it remains challenging to extract good representations from these models. For instance, even though language has a clear hierarchical structure going from characters through words to sentences, it is not apparent in current language models. We propose to improve the representation in sequence models by augmenting current approaches with an autoencoder that is forced to compress the sequence through an intermediate discrete latent space. In order to propagate gradients though this discrete representation we introduce an improved semantic hashing technique. We show that this technique performs well on a newly proposed quantitative efficiency measure. We also analyze latent codes produced by the model showing how they correspond to words and phrases. Finally, we present an application of the autoencoder-augmented model to generating diverse translations.
翻译:最近,序列的经常模式在许多任务中取得了成功,特别是在语言建模和机器翻译方面。然而,从这些模式中取得良好的表现仍然具有挑战性。例如,即使语言有着从字符到文字到句子的明确的等级结构,但在目前的语言模式中并不明显。我们提议通过增加一个自动编码器来改进序列模型的表达方式,该自动编码器将被迫通过中间离散的潜伏空间压缩序列。通过这种离散的表达方式来传播梯度,我们引入了一种改进的语义散列技术。我们展示了这种技术在新提出的定量效率衡量标准上运行良好。我们还分析了模型产生的潜在代码,表明它们与文字和短语的对应性。最后,我们介绍了一个自动编码器推荐模型的应用,以产生多种翻译。