The conventional lottery ticket hypothesis (LTH) claims that there exists a sparse subnetwork within a dense neural network and a proper random initialization method, called the winning ticket, such that it can be trained from scratch to almost as good as the dense counterpart. Meanwhile, the research of LTH in vision transformers (ViTs) is scarcely evaluated. In this paper, we first show that the conventional winning ticket is hard to find at weight level of ViTs by existing methods. Then, we generalize the LTH for ViTs to input images consisting of image patches inspired by the input dependence of ViTs. That is, there exists a subset of input image patches such that a ViT can be trained from scratch by using only this subset of patches and achieve similar accuracy to the ViTs trained by using all image patches. We call this subset of input patches the winning tickets, which represent a significant amount of information in the input. Furthermore, we present a simple yet effective method to find the winning tickets in input patches for various types of ViT, including DeiT, LV-ViT, and Swin Transformers. More specifically, we use a ticket selector to generate the winning tickets based on the informativeness of patches. Meanwhile, we build another randomly selected subset of patches for comparison, and the experiments show that there is clear difference between the performance of models trained with winning tickets and randomly selected subsets.
翻译:常规彩票假设( LTH) 声称在浓密的神经网络和适当的随机初始化方法中存在一个稀疏的子网络, 叫做中奖票, 这样它就可以从零到几乎与稠密的对口单位一样受到训练。 同时, 对视觉变压器( VITs) LTH 的研究很少进行评估 。 在本文中, 我们首先显示, 以现有方法在 VIT 重量水平上很难找到常规中胜票。 然后, 我们将 VIT 的 LTH 推广到输入图像的随机补丁中, 包括受 VIT 输入依赖的图像补丁。 也就是说, 存在一组输入图像补丁, 这样 VIT 就可以通过只使用这组补丁来从零到几乎和紧凑来训练。 我们称这组的中奖票是代表了输入中大量信息。 此外, 我们提出了一个简单有效的方法, 来寻找各种 VIT 的输入补接合。 包括 DeiT、 LVV- ViT VIT 和 Swin 变压的分级票, 更具体地是, 我们选择了另一个赢票的补票的补票, 的补票的补票的补票, 以赢得的折叠。