Natural language processing (NLP) algorithms are rapidly improving but often struggle when applied to out-of-distribution examples. A prominent approach to mitigate the domain gap is domain adaptation, where a model trained on a source domain is adapted to a new target domain. We present a new learning setup, ``domain adaptation from scratch'', which we believe to be crucial for extending the reach of NLP to sensitive domains in a privacy-preserving manner. In this setup, we aim to efficiently annotate data from a set of source domains such that the trained model performs well on a sensitive target domain from which data is unavailable for annotation. Our study compares several approaches for this challenging setup, ranging from data selection and domain adaptation algorithms to active learning paradigms, on two NLP tasks: sentiment analysis and Named Entity Recognition. Our results suggest that using the abovementioned approaches eases the domain gap, and combining them further improves the results.
翻译:自然语言处理( NLP) 算法正在迅速改善,但在应用到分配外实例时往往会挣扎。 减少域间差距的一个突出办法是领域适应, 即对源域进行训练的模型适应新的目标域。 我们提出了一个新的学习设置, “ 从零开始调整 ”, 我们认为这对以隐私保护的方式将NLP的覆盖范围扩大到敏感域至关重要。 在这个设置中, 我们的目标是从一组源域中有效地说明数据, 使经过训练的模型在一个敏感目标域上运行良好, 无法提供数据作注释。 我们的研究比较了这一具有挑战性的设置的若干方法, 从数据选择和域适应算法到积极学习范式, 包括两个NLP任务: 情绪分析和实体识别。 我们的结果表明, 使用上述方法可以缩小域间差距, 并把它们合并起来, 将结果进一步改进 。