Large pre-training language models (PLMs) have shown promising in-context learning abilities. However, due to the backbone transformer architecture, existing PLMs are bottlenecked by the memory and computational cost when scaling up to a large context size, leaving instruction tuning and in-context learning of many demonstration examples, as well as long-range language modeling under-explored. In this study, we propose a long-range language model EVALM based on an efficient transformer mechanism. EVALM is trained with 8k tokens per batch line and can test up to 256k-lengthed contexts with extrapolation, 128 times to the limit of existing PLMs (e.g. GPT3). Based on EVALM, we scale up the size of examples efficiently in both instruction tuning and in-context learning to explore the boundary of the benefits from more annotated data. Experimental results on a diverse set of tasks show that EVALM achieves 4.1% higher accuracy on average, and the average length of achieving the best accuracy score over tasks is around 12k. We find that in-context learning can achieve higher performance with more demonstrations under many-shot instruction tuning (8k), and further extending the length of instructions (16k) can further improve the upper bound of scaling in-context learning.
翻译:培训前的大型语言模型(PLM)显示了充满希望的内置学习能力,然而,由于主干变压器结构,现有PLM在扩展到大背景规模时被内存和计算成本所压抑,留下许多示范实例的教学调整和内文学习以及长期语言模型的探索不足。在本研究中,我们提议了一个长程语言模型EVALM,以高效变压器机制为基础。EVALM每批中有8千个标牌接受培训,可以测试256千长的背景,外推率最高为128倍,现有PLM(如GPT3)。根据EVALM,我们在教学调整和内文学习中有效地提升了范例的大小,以探索更多附加说明数据的好处的界限。一系列任务实验结果显示,EVALM平均达到4.1%的准确度,在任务上达到最佳准确度的平均长度约为12千分。我们发现,在Context教学中,可以进一步改进高级教学(16个)的升级,在高级教学中可以进一步改进。