We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method learns image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that, typically, the sum of local mutual information is a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.
翻译:我们建议并展示一种代表性学习方法,尽量扩大图像和文本地方特征之间的相互信息。这一方法的目标是利用描述图像结果的自由文本中的丰富信息,学习有用的图像表示方式。我们的方法是学习图像和文本编码器,鼓励由此产生的表述方式展示高水平的当地相互信息。我们利用与神经网络歧视者在相互信息估计方面的最新进展。我们争辩说,一般而言,地方相互信息的总和对于全球相互信息的制约较低。我们在下游图像分类工作中的实验结果显示了利用地方特征进行图像-文字表述学习的优势。