This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.
翻译:这项工作介绍了Itihasa,这是一个大型翻译数据集,包含93 000对梵文shlokas及其英文译文,从两个印度史诗《Ramayana》和《Mahabharata》中提取。我们首先描述了整理这样一个数据集背后的动机,接着进行了经验分析,以找出其细微之处。然后,我们将标准翻译模型的性能以该文集为基准,并表明甚至最先进的变压器结构都表现不佳,强调了数据集的复杂性。