Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning. Data selection improves target domain generalization by training further on pretraining data identified by relying on a small sample of target domain data. This work examines the benefit of data selection for language modeling and machine translation. Our experiments assess the complementarity of selection with fine tuning and result in practical recommendations: (i) selected data must be similar to the fine-tuning domain but not so much as to erode the complementary effect of fine-tuning; (ii) there is a trade-off between selecting little data for fast but limited progress or much data for slow but long lasting progress; (iii) data selection can be applied early during pretraining, with performance gains comparable to long pretraining session; (iv) data selection from domain classifiers is often more effective than the popular contrastive data selection method.
翻译:神经网域的改造通常取决于三个培训阶段:(一) 选定的数据必须与微调领域相似,但不能削弱微调的互补性;(二) 在选择少量数据以快速但有限的进展或大量数据以缓慢但长期的进展之间,可以权衡得失;(三) 数据选择可在预选阶段早期应用,其绩效收益可与长期的预演阶段相比;(四) 从域分类中选择的数据往往比流行的对比数据选择方法更有效。