Adversarial robustness is a key desirable property of neural networks. It has been empirically shown to be affected by their sizes, with larger networks being typically more robust. Recently, Bubeck and Sellke proved a lower bound on the Lipschitz constant of functions that fit the training data in terms of their number of parameters. This raises an interesting open question, do -- and can -- functions with more parameters, but not necessarily more computational cost, have better robustness? We study this question for sparse Mixture of Expert models (MoEs), that make it possible to scale up the model size for a roughly constant computational cost. We theoretically show that under certain conditions on the routing and the structure of the data, MoEs can have significantly smaller Lipschitz constants than their dense counterparts. The robustness of MoEs can suffer when the highest weighted experts for an input implement sufficiently different functions. We next empirically evaluate the robustness of MoEs on ImageNet using adversarial attacks and show they are indeed more robust than dense models with the same computational cost. We make key observations showing the robustness of MoEs to the choice of experts, highlighting the redundancy of experts in models trained in practice.
翻译:Adversarial 稳健性是神经网络的关键可取属性。 经验显示,它受到其大小的影响, 大型网络通常比较强大。 最近, Bubeck 和 Sellke 证明在Lipschitz 的函数常数中, 符合培训数据数量参数的Lipschitz 常数约束较低。 这就提出了一个令人感兴趣的问题, 具有更多参数但不一定更具有计算成本的功能, 具有更好的稳健性。 我们研究的是, 缺少专家混合模型(MoEs)的问题, 这使得有可能扩大模型规模, 增加大致不变的计算成本。 我们理论上表明, 在数据路径和结构的某些条件下, MOE 的常数比其密度高的对应方要小得多。 当最高加权专家投入时, 执行足够不同的功能时, MOE 的稳健性可能会受到影响。 我们接下来的经验评估了图像网络上ME的稳健性, 使用对抗性攻击, 并显示它们确实比高密度模型的计算成本。 我们用关键观察显示, 显示, 显示MOE 的稳健性在专家的选择模式中显示了经过培训的冗余性。