In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model and training script.
翻译:在本文中,我们以21种不同语言从网络上收集了大约6M FAQ对应数据。虽然这比现有的 FAQ检索数据集大得多,但也有其自身的挑战:内容重复和议题分布不均。我们采用了类似于“通过检索检索”的类似设置,并在该数据集上测试了各种双编码。我们的实验显示,除英语外,基于XLM-ROBERTA的多语言模型取得了最佳效果。资源较少的语言似乎相互学习,因为多语言模型比语言特定模型的MRR要高。我们的质量分析揭示了简单字数变化模型的易碎性。我们公开发布我们的数据集、模型和培训脚本。