The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.
翻译:NLP模式是培训一个强大的神经预测器,以在特定数据集上完成一项任务,这种培训的主导性模式在培训中占据主导地位,在具体数据集上培训一个强大的神经预测器,这导致在各种应用(如情绪分类、基于频谱的问答或机器翻译)中出现最先进的性能;然而,它所依据的假设是,数据分布是静止的,即数据是在培训和测试时间从固定分布中抽样的,数据是在培训和测试时间从固定分布中采集的。这种培训方式与我们人类如何在不断变化的信息流中从分布变化中学习和运行不相符。此外,它不适应于现实世界使用数据分布预计将在模型生命周期中转变的状态。这一理论的首要目标是描述这种变化在自然语言处理中可以采取的不同形式,并提议基准和评价衡量数据对当前深层次学习结构的影响。我们接下来要采取步骤,减轻分配变化对NLP模型的影响。我们为此制定了基于对分布稳健的优化框架进行对等调整的方法。我们从现实性规则更新的第三个模型展示了我们当前选择的更稳健的模型。