The research on text summarization for low-resource Indian languages has been limited due to the availability of relevant datasets. This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets. The ISUM 2022 dataset consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their ground-truth summarizations. In our work, we explore different pre-trained seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case, the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS model along with a translation mapping-based approach for Gujarati. Our scores on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4 as the evaluation metrics.
翻译:有关低资源印度语言的文本摘要研究由于相关数据集的可用性而受到限制。本文件概述了用于ILSUM 2022 Indic语言摘要数据集的各种深层次学习方法。 UNFIM 2022 数据集包含分别以印度英文、印地文和古吉拉特语撰写的新闻文章及其地面实况摘要。 在我们的工作中,我们探索了不同的预先培训的后继2当量模型,并微调了使用ILSUM 2022 数据集的后继2当量模型。 就我们而言,经过微调的 SoTA PEGASUS 模型为英语工作最成功,经过精调的印地语数据增强的IndicBART模型,再次经过微调的PEGASUS模型,以及古吉拉特语的翻译绘图方法。我们利用ROUGE-1、ROOUGE-2和ROUGE-4评估了我们获得的推断的分数。