A butterfly network consists of logarithmically many layers, each with a linear number of non-zero weights (pre-specified). The fast Johnson-Lindenstrauss transform (FJLT) can be represented as a butterfly network followed by a projection onto a random subset of the coordinates. Moreover, a random matrix based on FJLT with high probability approximates the action of any matrix on a vector. Motivated by these facts, we propose to replace a dense linear layer in any neural network by an architecture based on the butterfly network. The proposed architecture significantly improves upon the quadratic number of weights required in a standard dense layer to nearly linear with little compromise in expressibility of the resulting operator. In a collection of wide variety of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and at times outperform existing well-known architectures, but it also offers faster training and prediction in deployment. To understand the optimization problems posed by neural networks with a butterfly network, we also study the optimization landscape of the encoder-decoder network, where the encoder is replaced by a butterfly network followed by a dense linear layer in smaller dimension. Theoretical result presented in the paper explains why the training speed and outcome are not compromised by our proposed approach.
翻译:蝴蝶网络由对数多层组成, 每一层都有非零重量的线性数量( 预先指定 ) 。 快速的约翰逊- 林登斯特劳斯变异( FJLT) 可以作为一个蝴蝶网络代表, 并随后对坐标的随机子集进行投影。 此外, 一个基于FJLT的随机矩阵, 极有可能与矢量上任何矩阵的动作相近。 基于这些事实, 我们提议用一个基于蝴蝶网络的建筑来取代任何神经网络中的稠密线性层。 拟议的结构将标准稠密层所需重量的四倍数大幅改进至近线性, 且结果操作者的清晰度几乎没有妥协性。 在一系列广泛的实验中, 包括对 NLP 和 视觉数据的监督预测, 我们显示, 这不仅产生匹配结果, 有时甚至比任何已知的矢量结构更完美, 我们还在部署过程中提供更快的培训和预测。 为了理解由蝴蝶网络构成的最优化问题, 我们还研究一个遵循的编码- 电极网络的优化版面面面, 并用一个不易变固的图像化的网络结果来解释。