A deployed question answering (QA) model can easily fail when the test data has a distribution shift compared to the training data. Robustness tuning (RT) methods have been widely studied to enhance model robustness against distribution shifts before model deployment. However, can we improve a model after deployment? To answer this question, we evaluate test-time adaptation (TTA) to improve a model after deployment. We first introduce COLDQA, a unified evaluation benchmark for robust QA against text corruption and changes in language and domain. We then evaluate previous TTA methods on COLDQA and compare them to RT methods. We also propose a novel TTA method called online imitation learning (OIL). Through extensive experiments, we find that TTA is comparable to RT methods, and applying TTA after RT can significantly boost the performance on COLDQA. Our proposed OIL improves TTA to be more robust to variation in hyper-parameters and test distributions over time.
翻译:当测试数据比培训数据有分布变化时,部署的回答(QA)模式很容易就失败。强力调试(RT)方法已经进行了广泛研究,以提高模型在模型部署前对分布变化的稳健性。然而,我们能否在部署后改进模型?为了回答这个问题,我们评估测试时间调整(TTA),以便在部署后改进模型。我们首先引入了COLDQA,这是针对文本腐败和语言及域的变化的强力质调的统一评价基准。然后我们评估了以前关于COLDQA的TA方法,并将其与RT方法进行比较。我们还提出了称为在线模仿学习(OIL)的新型TA方法。通过广泛的实验,我们发现TTA与RT方法相似,在RT之后应用TA可以大大提升COLDQA的性能。我们提议的OIL改进TA,以便更有力地改变超参数和测试分布。