With the increase in availability of large pre-trained language models (LMs) in Natural Language Processing (NLP), it becomes critical to assess their fit for a specific target task a priori - as fine-tuning the entire space of available LMs is computationally prohibitive and unsustainable. However, encoder transferability estimation has received little to no attention in NLP. In this paper, we propose to generate quantitative evidence to predict which LM, out of a pool of models, will perform best on a target task without having to fine-tune all candidates. We provide a comprehensive study on LM ranking for 10 NLP tasks spanning the two fundamental problem types of classification and structured prediction. We adopt the state-of-the-art Logarithm of Maximum Evidence (LogME) measure from Computer Vision (CV) and find that it positively correlates with final LM performance in 94% of the setups. In the first study of its kind, we further compare transferability measures with the de facto standard of human practitioner ranking, finding that evidence from quantitative metrics is more robust than pure intuition and can help identify unexpected LM candidates.
翻译:随着在自然语言处理(NLP)中接受过培训的大型语言模型(LMs)的提供量的增加,必须先验地评估它们是否适合具体的目标任务,因为微调现有语言模型的整个空间在计算上是令人望而却步的,而且不可持续。然而,在NLP中,编码可转移性估计很少得到重视,甚至没有引起人们的注意。在本文中,我们建议从一组模型中产生定量证据,预测哪些LM(LM)在一项目标任务上表现最佳,而不必对所有候选人进行微调。我们提供了关于10项NLP(LP)任务的LM等级的全面研究,涵盖两种基本问题的分类和结构预测类型。我们采用了计算机视野(CV)中最先进的最高证据逻辑(LogME)测量方法,发现它与94%的设置中最终LM性能呈正比。在这类模型的第一次研究中,我们进一步比较了可转移性措施与实际的人类执业人员等级标准,发现定量指标的证据比纯粹的直觉和能够查明意外的LM候选人。