Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research. The PyTorch code is available at https://github.com/huawei-noah/Efficient-AI-Backbones and the MindSpore code is available at https://gitee.com/mindspore/models.
翻译:在深层次学习的计算机视觉系统中,广泛使用的革命神经网络和变压器将图像作为网格或序列结构处理,不能灵活地捕捉非常规和复杂天体。在本文中,我们提议将图像作为图表结构,并推出一个新的视野GNN(ViG)结构,为视觉任务提取图形层面的特征。我们首先将图像分割为若干被视为节点的补丁,然后通过连接最近的邻居来构建一个图形。根据图像图示,我们建立VIG模型,以改造和交换所有节点的信息。ViG由两个基本模块组成:图形变动模块,用于汇总和更新图形信息,FFFFNF模块,有两个线性结构,用于节点特征转换。ViG的偏移和金字塔结构都是用不同的模型大小构建的。关于图像识别和对象探测任务的广泛实验显示了我们ViG结构的优越性。我们希望GNNN关于一般视觉任务的开创性研究将为未来研究提供有用的灵感和经验。PyTogonforation-Serus/Minal-Squal-Squal-Shamb/debs。在http://Sy-Sy-Shang-Sy-Shamabs/D-Sy-Sy-Sy-Syaldald-Scodaldaldaldald-Scod-Scod-S/s可以提供。