Nowadays, many researchers are focusing their attention on the subject of machine translation (MT). However, Persian machine translation has remained unexplored despite a vast amount of research being conducted in languages with high resources, such as English. Moreover, while a substantial amount of research has been undertaken in statistical machine translation for some datasets in Persian, there is currently no standard baseline for transformer-based text2text models on each corpus. This study collected and analysed the most popular and valuable parallel corpora, which were used for Persian-English translation. Furthermore, we fine-tuned and evaluated two state-of-the-art attention-based seq2seq models on each dataset separately (48 results). We hope this paper will assist researchers in comparing their Persian to English and vice versa machine translation results to a standard baseline.
翻译:目前,许多研究人员正在把注意力集中在机器翻译(MT)问题上。然而,尽管以大量资源丰富的语言(如英文)进行了大量研究,波斯语机器翻译仍未得到探索。此外,尽管在统计机器翻译方面已经对波斯语的某些数据集进行了大量研究,但目前还没有关于每个材料的基于变压器的文本2文本模型的标准基线。这项研究收集和分析了波斯语和英语翻译所使用的最受欢迎和最有价值的平行的子公司。此外,我们还对每个数据集(48项结果)的两种基于最新关注的后继2Seq模型进行了微调和评估。我们希望这份文件将有助于研究人员将其波斯语和英语的反之亦然的机器翻译结果与标准基线进行比较。