Despite the success of the neural sequence-to-sequence model for abstractive text summarization, it has a few shortcomings, such as repeating inaccurate factual details and tending to repeat themselves. We propose a hybrid pointer generator network to solve the shortcomings of reproducing factual details inadequately and phrase repetition. We augment the attention-based sequence-to-sequence using a hybrid pointer generator network that can generate Out-of-Vocabulary words and enhance accuracy in reproducing authentic details and a coverage mechanism that discourages repetition. It produces a reasonable-sized output text that preserves the conceptual integrity and factual information of the input article. For evaluation, we primarily employed "BANSData" - a highly adopted publicly available Bengali dataset. Additionally, we prepared a large-scale dataset called "BANS-133" which consists of 133k Bangla news articles associated with human-generated summaries. Experimenting with the proposed model, we achieved ROUGE-1 and ROUGE-2 scores of 0.66, 0.41 for the "BANSData" dataset and 0.67, 0.42 for the BANS-133k" dataset, respectively. We demonstrated that the proposed system surpasses previous state-of-the-art Bengali abstractive summarization techniques and its stability on a larger dataset. "BANS-133" datasets and code-base will be publicly available for research.
翻译:尽管抽象文本总结的神经序列到序列模型取得了成功,但该模型有一些缺点,例如重复不准确的事实细节,并倾向于重复自己。我们建议建立一个混合指针生成器网络,以解决复制事实细节的缺点,不适当和重复短语。我们利用一个混合指针生成器网络,增加基于关注的序列到序列的序列,这种网络可以产生外言,提高复制真实细节的准确性,并建立一个不鼓励重复的覆盖机制。它产生了一个合理规模的产出文本,保存了投入文章的概念完整性和事实信息。在评估中,我们主要使用“BANSData”——一个得到高度公开采纳的孟加拉语数据集。此外,我们准备了一个大规模数据集,称为“BANS-133”,由133k Bangla新闻文章组成,与人生成摘要有关。我们实验了拟议的模型,我们实现了“BANS-133”数据集和“BANS-133K”数据库的0.67,0.42,BANS-133号数据集将分别比先前的“BANS-13MAS-BAR”数据系统更加稳定。