Many datasets have been created for training reading comprehension models, and a natural question is whether we can combine them to build models that (1) perform better on all of the training datasets and (2) generalize and transfer better to new datasets. Prior work has addressed this goal by training one network simultaneously on multiple datasets, which works well on average but is prone to over- or under-fitting different sub-distributions and might transfer worse compared to source models with more overlap with the target dataset. Our approach is to model multi-dataset question answering with a collection of single-dataset experts, by training a collection of lightweight, dataset-specific adapter modules (Houlsby et al., 2019) that share an underlying Transformer model. We find that these Multi-Adapter Dataset Experts (MADE) outperform all our baselines in terms of in-distribution accuracy, and simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance, offering a strong and versatile starting point for building new reading comprehension systems.
翻译:许多数据集是为培训阅读理解模型而创建的,一个自然的问题是,我们是否能够结合它们来建立:(1) 在所有培训数据集上更好地发挥作用,(2) 概括化和向新的数据集更好地转让。以前的工作通过在多个数据集上同时培训一个网络来实现这一目标,这些网络平均运作良好,但容易造成不同分分布过多或不足,而且可能比源模式更差,与目标数据集重叠较多的源模型相比。我们的做法是用一组单一数据集专家来模拟多数据集问题解答,方法是培训一批光重、特定数据集的适应模块(Houlsby等人,2019年),这些模块共享一个基本的变异模型。我们发现,这些多绘图数据集专家(MADE)在分配准确性方面超越了我们的所有基线,而基于参数保存铅的简单方法则更差于零光一般化和几发式传输性能,为建立新的阅读理解系统提供了一个强大和多才化的起点。