Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly modeled local features but ignored the word-boundary information. This results in redundant and ambiguous attention distributions, which lacks of interpretability. In this work, we define those scales in different linguistic units, including sub-words, words and phrases. We built a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. The proposed \textbf{U}niversal \textbf{M}ulti\textbf{S}cale \textbf{T}ransformer, namely \textsc{Umst}, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.
翻译:多尺度的地貌分级在计算机视觉领域取得了成功。 这进一步激励研究人员设计了用于自然语言处理的多尺度变异器, 主要是基于自省机制。 例如, 限制可容纳域在头部之间的宽度, 或者通过组合提取本地细微的特性。 但是, 大部分现有作品直接模拟了本地特征, 但却忽略了单向信息。 这导致注意力分配重复和模糊, 缺乏解释性。 在这项工作中, 我们定义了不同语言单位中的这些尺度, 包括子词、 词和短语。 我们根据字义上的信息和句级先前的知识, 建立了一个多尺度变异器模型。 提议的\ textbf{ U}\ textbf{ M{ textb{ S} calee\ textbf{T}Cale{ textbf{T}} ransorf, 即\ textsc{ Umst}, 在两个序列生成任务上进行了评估。 值得注意的是, 它在几个测试组的强基线上取得了一致的业绩收益 。