In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.
翻译:在本文中,我们将先前的知识、多尺度结构引入自省模块。我们建议多规模变换器使用多规模多头自省来捕捉不同规模的特征。根据语言视角和对受过培训的巨型变换器(BERT)的分析,我们进一步设计了控制每一层规模分布的战略。三种不同任务(21个数据集)的结果显示,我们的多规模变换器在小型和中小型变换器方面一贯且显著地优于标准变换器。