Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISumm uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISumm. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISumm on other Indian languages. Our experiments of GAE-ISumm in seven languages make the following observations: (i) it is competitive or better than state-of-the-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries.
翻译:文件总和旨在创建文本文件的精确和连贯摘要。许多深层次学习总和模型主要针对英语,往往需要大量的训练材料和经过培训的高效语言模型和工具。然而,低资源印度语言的英语总和模型往往受到丰富的形态变异、语法和语义差异的限制。在本文件中,我们提议一个未经监督的GAE-ISumm综合模型,从文本文件中摘取摘要。特别是,我们提议的GAE-ISumm模型使用Gape Autoencoder(GAE)来共同学习文本说明和文件摘要。我们还提供了一个手动加注的Telugu综合数据集,以试验我们的GAE-ISumm模型。此外,我们试验最公开的印度语言总和数据集,以调查GAE-ISumm对其他印度语言的效果。我们用七种语言进行的GAE-ISumm实验得出以下观察意见:(一)它具有竞争性,或优于州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-