Even though convolutional neural networks can classify objects in images very accurately, it is well known that the attention of the network may not always be on the semantically important regions of the scene. It has been observed that networks often learn background textures which are not relevant to the object of interest. In turn this makes the networks susceptible to variations and changes in the background which negatively affect their performance. We propose a new two-step training procedure called split training to reduce this bias in CNNs on both Infrared imagery and RGB data. Our split training procedure has two steps: using MSE loss first train the layers of the network on images with background to match the activations of the same network when it is trained using images without background; then with these layers frozen, train the rest of the network with cross-entropy loss to classify the objects. Our training method outperforms the traditional training procedure in both a simple CNN architecture, and deep CNNs like VGG and Densenet which use lots of hardware resources, and learns to mimic human vision which focuses more on shape and structure than background with higher accuracy.
翻译:尽管共生神经网络可以非常准确地对图像中的物体进行分类,但众所周知,网络的注意力可能并不总是集中在现场具有重要意义的图像区域,人们注意到,网络往往学习与兴趣对象无关的背景纹理。反过来,这又使网络容易受到变化和背景变化的影响,对其性能产生不利影响。我们提出了一个新的两步培训程序,称为分两步培训,以减少CNN在红外图像和 RGB 数据方面的偏差。我们的分解培训程序有两个步骤:在利用没有背景的图像进行培训时,首先利用MSE对带有背景的图像网络层进行培训,以匹配同一网络的启动;然后用这些层进行冻结,对网络的其余部分进行跨机体损失的培训,以对物体进行分类。我们的培训方法超越了简单的CNN结构中的传统培训程序,以及使用大量硬件的深层CNN,并学习模拟人类视野,这种视野比背景更精确地侧重于形状和结构。