The availability of large, high-quality datasets has been one of the main drivers of recent progress in question answering (QA). Such annotated datasets however are difficult and costly to collect, and rarely exist in languages other than English, rendering QA technology inaccessible to underrepresented languages. An alternative to building large monolingual training datasets is to leverage pre-trained language models (PLMs) under a few-shot learning setting. Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained, thus avoiding costly annotation. Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines, bridges nearly 60% of the gap between an English-only baseline and a fully supervised upper bound trained on almost 50,000 hand labeled examples, and always leads to substantial improvements compared to fine-tuning a QA model directly on labeled examples in low resource settings. Experiments on the TyDiQA-GoldP and MLQA benchmarks show that few-shot prompt tuning for data synthesis scales across languages and is a viable alternative to large-scale annotation.
翻译:大量高质量数据集的可用性是最近在答题(QA)中取得进展的主要驱动力之一。然而,这种附加说明的数据集很难收集,而且成本很高,而且很少以英文以外的语言收集,因此代表性不足的语文无法获得质量评估技术。建立大型单一语言培训数据集的替代办法是在几发学习环境中利用经过预先培训的语言模型(PLM)的杠杆。我们的方法,Qameleon,使用一个PLM自动生成多语种数据,对QA模型进行培训,从而避免昂贵的注释。为数据合成对PLM进行只用五个例子的快速调整,使数据合成的准确性高于基于翻译的基线,将几乎60%的仅使用英语的基线和经过充分监督的、贴有近50 000个手贴标签的例子的上限之间的差距缩小,并总是导致与在低资源环境下直接对贴有标签的例子的质量控制模型进行微调相比的重大改进。 TydiQA-GoldP和MLQA基准的实验表明,对跨语言的数据综合尺度的快速调整很少,是可行的大尺度的替代方法。