In this paper, we propose a method using the fusion of CNN and transformer structure to improve image classification performance. In the case of CNN, information about a local area on an image can be extracted well, but there is a limit to the extraction of global information. On the other hand, the transformer has an advantage in relatively global extraction, but has a disadvantage in that it requires a lot of memory for local feature value extraction. In the case of an image, it is converted into a feature map through CNN, and each feature map's pixel is considered a token. At the same time, the image is divided into patch areas and then fused with the transformer method that views them as tokens. For the fusion of tokens with two different characteristics, we propose three methods: (1) late token fusion with parallel structure, (2) early token fusion, (3) token fusion in a layer by layer. In an experiment using ImageNet 1k, the proposed method shows the best classification performance.
翻译:在本文中,我们提出了一个使用CNN和变压器结构组合的方法来提高图像分类性能。 在CNN中,图像上有关局部区域的信息可以很好地提取,但提取全球信息有一定的限度。另一方面,变压器在相对全球的提取方面具有优势,但有一个缺点,因为它需要大量的内存来提取本地特征值。在图像中,它通过CNN转换成特效地图,每个地貌地图像素被认为是一种象征。同时,图像可以分为补丁区域,然后与将图象视为象征的变压器方法结合。对于具有两种不同特性的标牌的聚合,我们建议三种方法:(1) 末代号与平行结构的聚合,(2) 早期代号聚合,(3) 一层的代号融合。在使用图像网 1k 的实验中,建议的方法显示最佳的分类性能。