Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.
翻译:自我关注是设计用于在相继数据中建模远程互动的建筑图案,它催生了自然语言处理及以后的许多近期突破。 这项工作对自省模块的感性偏差进行了理论分析。 我们的重点是严格确定哪些功能和远程依赖自省区块更愿意代表。 我们的主要结果显示, 连接式北向变换器网络“ 产生稀多变量 ” : 单一自省头可以代表输入序列的微弱功能, 样本复杂度只能与上下文长度对齐。 为了支持我们的分析, 我们提出合成实验, 以探索与变换器一起学习稀疏布林函数的样本复杂性 。