Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality of sentences during RNN-based decoding (reconstruction) decreases with the length of the text. We propose a sequence-to-sequence, purely convolutional and deconvolutional autoencoding framework that is free of the above issue, while also being computationally efficient. The proposed method is simple, easy to implement and can be leveraged as a building block for many applications. We show empirically that compared to RNNs, our framework is better at reconstructing and correcting long paragraphs. Quantitative evaluation on semi-supervised text classification and summarization tasks demonstrate the potential for better utilization of long unlabeled text data.
翻译:从长长的文本序列中学习潜在表达方式是许多自然语言处理应用程序中的重要第一步。经常性神经网络已成为这项具有挑战性的任务的基石。然而,在以网络为基础的解码(重新构建)过程中,判决的质量随文本长度的长度而下降。我们建议了一个不受上述问题影响的序列到序列、纯革命和分革命的自动编码框架,同时也是计算效率。拟议方法简单易行,易于实施,并且可以作为许多应用程序的构件加以利用。我们从经验上表明,与网络相比,我们的框架在重建和纠正长段方面做得更好。关于半监督文本分类和汇总任务的定量评估表明,有可能更好地利用长期未加标记的文本数据。