Community Question Answering (CQA) forums provide answers for many real-life questions. Thanks to the large size, these forums are very popular among machine learning researchers. Automatic answer selection, answer ranking, question retrieval, expert finding, and fact-checking are example learning tasks performed using CQA data. In this paper, we present PerCQA, the first Persian dataset for CQA. This dataset contains the questions and answers crawled from the most well-known Persian forum. After data acquisition, we provide rigorous annotation guidelines in an iterative process, and then the annotation of question-answer pairs in SemEvalCQA format. PerCQA contains 989 questions and 21,915 annotated answers. We make PerCQA publicly available to encourage more research in Persian CQA. We also build strong benchmarks for the task of answer selection in PerCQA by using mono- and multi-lingual pre-trained language models
翻译:社区问题解答(CQA)论坛为许多实际生活问题提供了答案。由于规模庞大,这些论坛在机器学习研究人员中非常受欢迎。自动回答选择、回答排名、问题检索、专家发现和事实检查是使用CQA数据执行的示范学习任务。在本文中,我们介绍CQA的第一个波斯数据集PerCQA。这个数据集包含从最著名的波斯论坛检索的问题和答案。在获取数据后,我们在迭接过程中提供严格的批注指南,然后用SemEvalCQA格式对问答进行批注。PerCQA包含989个问题和21 915个附加的答案。我们公开提供PerCQA,鼓励在波斯语中进行更多的研究。我们还通过使用单语和多种语言的预先培训语言模型,为PRCQA的答案选择任务建立了强有力的基准。