Transformers are widely used for solving tasks in natural language processing, computer vision, speech, and music domains. In this paper, we talk about the efficiency of transformers in terms of memory (the number of parameters), computation cost (number of floating points operations), and performance of models, including accuracy, the robustness of the model, and fair \& bias-free features. We mainly discuss the vision transformer for the image classification task. Our contribution is to introduce an efficient 360 framework, which includes various aspects of the vision transformer, to make it more efficient for industrial applications. By considering those applications, we categorize them into multiple dimensions such as privacy, robustness, transparency, fairness, inclusiveness, continual learning, probabilistic models, approximation, computational complexity, and spectral complexity. We compare various vision transformer models based on their performance, the number of parameters, and the number of floating point operations (FLOPs) on multiple datasets.
翻译:变异器被广泛用于解决自然语言处理、计算机视觉、语音和音乐领域的任务。 在本文中,我们谈论变异器在记忆(参数数量)、计算成本(浮点数)和模型性能(包括准确性)、模型的稳健性和公平的无偏差特性等方面的效率。我们主要讨论了图像分类任务的愿景变异器。我们的贡献是引入一个高效的360框架,其中包括愿景变异器的各个方面,使其对工业应用更加有效。我们通过考虑这些应用,将变异器分为多个层面,如隐私、稳健性、透明度、公平性、包容性、持续学习、概率模型、近似、计算复杂性和光谱复杂性。我们根据各种变异器的性能、参数数量和多个数据集的浮动点操作数量对各种变异器模型进行比较。