MFAQ:多语种常见问题数据集 (MFAQ: a Multilingual FAQ Dataset)

In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model and training script.

翻译：在本文中,我们以21种不同语言从网络上收集了大约6M FAQ对应数据。虽然这比现有的 FAQ检索数据集大得多,但也有其自身的挑战:内容重复和议题分布不均。我们采用了类似于“通过检索检索”的类似设置,并在该数据集上测试了各种双编码。我们的实验显示,除英语外,基于XLM-ROBERTA的多语言模型取得了最佳效果。资源较少的语言似乎相互学习,因为多语言模型比语言特定模型的MRR要高。我们的质量分析揭示了简单字数变化模型的易碎性。我们公开发布我们的数据集、模型和培训脚本。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

面向自然语言处理的深度学习对抗样本综述

专知会员服务

45+阅读 · 2021年1月18日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日