Large NLP models have recently shown impressive performance in language understanding tasks, typically evaluated by their fine-tuned performance. Alternatively, probing has received increasing attention as being a lightweight method for interpreting the intrinsic mechanisms of large NLP models. In probing, post-hoc classifiers are trained on "out-of-domain" datasets that diagnose specific abilities. While probing the language models has led to insightful findings, they appear disjointed from the development of models. This paper explores the utility of probing deep NLP models to extract a proxy signal widely used in model development -- the fine-tuning performance. We find that it is possible to use the accuracies of only three probing tests to predict the fine-tuning performance with errors $40\%$ - $80\%$ smaller than baselines. We further discuss possible avenues where probing can empower the development of deep NLP models.
翻译:大型国家语言方案模型最近显示了语言理解任务方面令人印象深刻的成绩,通常以其微调性能加以评价。或者,作为解释大型国家语言方案模型内在机制的轻量级方法,调查受到越来越多的关注。在调查中,对热后分类人员进行了关于“外部”数据集的培训,以诊断具体能力。在对语言模型进行勘测后得出了有见地的结论的同时,它们似乎与模型的开发脱节。本文探讨了探究深度国家语言方案模型以提取在模型开发中广泛使用的代用信号 -- -- 微调性能的效用。我们发现,有可能使用仅三次实验测试的灵巧性能来预测错误的微调性能40美元 - 80美元比基线小。我们进一步讨论了研究能够促进深层国家语言方案模型开发的可能途径。