Recent progress in pretrained Transformer-based language models has shown great success in learning contextual representation of text. However, due to the quadratic self-attention complexity, most of the pretrained Transformers models can only handle relatively short text. It is still a challenge when it comes to modeling very long documents. In this work, we propose to use a graph attention network on top of the available pretrained Transformers model to learn document embeddings. This graph attention network allows us to leverage the high-level semantic structure of the document. In addition, based on our graph document model, we design a simple contrastive learning strategy to pretrain our models on a large amount of unlabeled corpus. Empirically, we demonstrate the effectiveness of our approaches in document classification and document retrieval tasks.
翻译:在培训前的基于变异器的语言模型中,最近的进展表明在学习文本的背景表述方面取得了巨大成功。然而,由于四级自省的复杂性,大多数经过培训的变异器模型只能处理相对较短的文本。在建模非常长的文件方面,这仍然是一个挑战。在这项工作中,我们建议在现有的预先培训的变异器模型之上使用一个图形关注网络来学习文件嵌入。这个图形关注网络使我们能够利用文件的高层次语义结构。此外,根据我们的图表文档模型,我们设计了一个简单的对比式学习战略,在大量无标签的文体上预设我们的模型。我们生动地展示了我们在文件分类和文件检索任务方面的做法的有效性。